


_ PRINTED AT THE UNIVERSITY ee takes 














VoLuME 42, Parts 3 AND 4 DECEMBER 1955 





POPULATION ESTIMATION BASED ON CHANGE OF COMPOSITION 
CAUSED BY A SELECTIVE REMOVAL 


By DOUGLAS G. CHAPMAN* 


Department of Mathematics, University of Washington and Readership in the 
Design and Analysis of Scientific Experiment, Oxford University 


1. INTRODUCTION 


It has been noted in wildlife studies that it is possible to estimate the size of an animal 
population on the basis of the change of sex ratio following a kill concentrated on one sex. 
It is difficult to ascertain to whom this idea should be credited, and what name should be 
given to it. Scattergood (1954), in a general survey, lists several references to the method. 
Chapman (1954) called the procedure the dichotomy method of estimation and showed that 
the usual ‘intuitive’ estimates are also maximum-likelihood estimates. It is not clear that 
this title is sufficiently descriptive, and in any case the classification does not necessarily 
need to be dichotomous. On the other hand, change of composition does seem to reflect the 
essential aspect of the procedure. 

The assumptions involved in this method appear to be simpler and more likely to be 
fulfilled than those underlying other population estimation procedures. Moreover, it will 
be shown that non-fulfilment of some of the assumptions may involve no serious. con- 
sequences. This is one of the phases of the method studied in this paper. 

To be balanced against this is the fact that the amount of information yielded is smaller 
for the same effort than is obtainable from the more usual method of estimation based on 
the recapture of marked members of the population. A quantitative comparison is given 
below of the two methods. Some problems of optimum design in the sampling that accom- 
panies a change of composition estimate are considered and some extensions of the pro- 
cedure studied. 


2. BINOMIAL MODEL 


The simplest situation arises if a ‘closed’ population is assumed, i.e. one where there is 
no emigration or immigration and births and deaths are negligible over the period of the 
experiment—except for the removal process. As to the removal process, all that is required 
of it is that the final or cumulative total be known and a breakdown into at least two ciasses 
or components. The third assumption for the simple binomial model is that random samples 
have been taken of the population before and after the removal process. The randomness 
is in respect to the specified classes or components. 
The following notation is needed: 


N; = population size at time ¢; (¢ = 0,1) made up of two 
Unknown parameters: classes X and 'Y, 
X;, ¥; = size of classes X and Y at times t,. 


* Research initiated under sponsorship of the Office of Naval Research. The author is currently 
a Fellow of the Guggenheim Foundation. 


19 Biom, 42 








280 Population estimation based on change of composition 
It is convenient to write P, = X,/N; (i = 0,1): 


# 


= size of random samples taken at times ¢,, 


oe 
| 


Kn oS = X,—X, = removal from class X, 
ee ae R, = Y,-—Y, = removal from class Y, 


R= R,+R, = total removal. 


Random variables: X;, y; = the number of classes X and Y respectively in sample i 
(¢ = 0,1). 


Analogous to P;, we define p; = x,/n; {i = 0.1). 
As noted in Chapman (1954) the maximum-likelihood estimates of X, and N, are 


x Xo(n,R,— x, R) Me PAR, — p, Rh) 
o= = : 





(1) 





Ny Ly — Ng, Po Pi 
v é n(n, Rk, —x,R) oe R,—p,k (2) 
Ny Xy— Nye Po— Pi” 


while the asymptotic variances of X, and N, are 


Y, 








oy R,) = 0% pp 214 py 
0 1 
: 3 
(R— Py (8) 
o*(fiq) == 
0 1 
— 4 
(P,—P, (4) 


In the usage made of this mcthod in wildlife studies it is N, that is ordinarily estimated, 
based as noted above, on the change of sex ratio. The sports kill records show the removal 
of each sex. In many cases the kill is confined to males; this represents a very favourable 
situation. It should be pointed out that under the assumptions made it is immaterial 
whether attention is focused on time f, or time ¢;. 

In the possible application of this procedure to the estimation of fish populations, it 
must be recognized the removal is not always selective by sex. However, here the classi- 
fication may be by size or by species. In many fisheries there is a size restriction on fish 
that may be retained by the fishery. If the biologist can sample in a manner to include 
randomly other sizes, then this change of composition estimation procedure may be used. 
In this case, or in case the classification is by species, the experimenter will usually be more 
interested in estimating X. 

It may be suggested that where X and Y represent qualitatively different populations, 
one should expect different catchabilities and consequently the procedure based on such a 
classification would be unsatisfactory. However, in the rather important case that R, = 0, 
. the estimate remains asymptotically unbiased even if the catchability of the X population 
is not the same as that of the Y population (for the experimenter or sampler). We repeat 
that the catchabilities associated with the removal process are irrelevant. 

For consider that X and Y have catchabilities A, and A, and assume the total sample 
catch has a Poisson distribution. Then given n,, the sample catch size, x;, has a binomial 
distribution with parameters n,, X,/(N;—Y,6), where 6 = 1—A,/A,. 





om) 


(1) 


(2) 


(3) 








Dove as G. CHAPMAN 281 


Then replacing the random variables in x o by their expectations (which they will con- 
verge to in probability as the sample sizes are increased), we have 


X,(N)R, -X,R—6Y,R,+8R,R,) ms 
(NR,—X,R—d%R,+dX_R,) ’ 


and the result stated follows immediately. 
However, N, does not have the same desirable property, for replacing 2;, y; by their 
expectation under the model of unequal catchabilities 


(NR, —X,R—-6¥%)R, + dR, R,) = 
(NR, — X,)R—0%,R,+6X,R,)' 


Nevertheless, the fact that X, is still estimable indicates that it may be possible to use a 
quite different. species in this estimation procedure, e.g. a scrap fish may form the Y class 
while the desirable sports fish is the X class. It should be pointed out that the distribution 
of the estimate will be modified, in case A,,+ A,, even though R, = 0. In particular, in this 
case the asymptotic variance of x, is 


Ce Ns eee 
in—Ppl nm, (1-P,d)? n, (1-—Pd)*] 


X, = 








Ne = (N)—¥%8) 


o*(X,(8)) = 





(7) 


The variation of o*(X,) with respect to d depends on all of the unknown parameters. 
However, it is possible to state some useful qualitative results. These can be determined 
simply from a study of the function 

f(8) = (1-8 /(L— Fd (6< 1), 


which attains its maximum at 6 = 2—1/P,. This maximum is [4P,(1—P,)]- > 1. 
The qualitative results are summarized in the following table: 








i] | | 
s>0 é<0 
| A,>A, A<A, 
| A A : A 
P,.Pi<i 7X, (8)) <o%X, (0)) 7X, (6)) >o%(X,y (0)) 
Po, Py>t o%(%, (8)) > o%(X, (0) 0%, (8))<o(Xq (0)) 











The favourable cases occur in the main diagonal; they are favourable in the sense that if 
o*(X, (0)) is used to estimate the variance or determine a confidence interval, any errors 
incurred will be on the conservative side. In terms of a working rule for the sample con- 
sidered, if the undesirable fish are abundant relative to the desirable fish, then the procedure 
is more satisfactory if the sampling device is such that the desirable fish are more catchable. 
Since f(6) achieves a maximum value of unity at 6 = 0 for P, = 4, it seems reasonable to 
suggest that when the P, straddle } and do not differ greatly from it, 0X, (6)) will differ 
little from o°(X, (0)). 

One other possibility of error in this procedure arises from the fact that R,, R, and R 
may not be true totals removed from the population. In general, it is to be expected that 
the removal will be under-evaluated because of the unreported kill as well as the unknown 
natural mortality. 


19-2 











282 Population estimation based on change of composition 


Suppose that the actual removals are R,+A,, R,+A,, R+A, where A,+A, =A and 
where we may write A, = KR,, A= KR+e. 
Ther for n;—> 00, the estimate N, converges in probability to 
NR,—X,R-eR, 
A a a Oa (1+K)—eX,)" (8) 
It is convenient to denote the term in the brackets of (8) by 1+. X, also converges in 
probability, under these conditions, to X,(1 +). 


Consider first the case where ¢ = 0, i.e. the unknown removal of X and Y is in the same 
proportion as the known removal. Then 


1+6=(1+K)*x1-K for K small; 





hence the relative asymptotic bias is negative and of the order of the unknown proportion 
of the removal. 
For the more important case where ¢€ is not zero we restrict attention to the situation 
where R, = 0. Then 
y <a 
1+b= = 





Yo(1+K) $9 


If the mortality or unknown removal of the X’s and Y’s is proportional to the size of these 
roups, then > > ' 

Sanih A, = YoA,/Xo = YoKR/Xq 

and A= KRN|Xo, ¢=Y,KR/Xo. 

Hence b=—KR/X,. 


In this situation the relative bias is still negative and of an even smaller order than in the 
case where € = 0. 

In the usage of (4) (or (3)) to estimate the large sample variances, in a situation where 
part of the removal is unknown to the experimenter, errors will arise in the incorrect 
estimation of the X,, Y;. It is easily seen that, while on the average, for sufficiently large 
samples, Xo, Y) are underestimated, X, and Y, will be overestimated and consequently the 
estimation errors will compensate. From a qualitative point of view, therefore, it appears 
that the errors due to bias in the estimation of the variances are of much smaller con- 
sequence than the sampling errors. 

The cases considered above are by no means the only possibilities that might occur with 
a portion of the removal unknown to the experimenter. They represent, however, two cases 
of practical importance. Of course if no assumptions whatsoever are made as to the size 
and relative magnitudes of A, and A,, it is clear that they may be such as to vitiate the 
estimation procedure. 


3. OPTIMUM SAMPLE ALLOCATION 


Since this method of population estimation involves two samples over which the experi- 
-menter has some degree of control, it is natural to inquire as to the optimum disposition of 
effort between these samples. If N, is to be estimated with o*(N,) to be minimized, subject 
to the restriction that n,+m, be fixed, then elementary calculus leads to the following 
equation for the ratio of n, to ny: 

q 1 0 Ny X,Y,\t 


= (xy) - (9) 





No 





nd 


ne 


on 


on 


Se 





Dove tas G. CHAPMAN 283 


Formula (9) involves parameters which are unknown at the outset of the experiment. 
It is possible to use it to suggest qualitative rules of procedure. Consider the most favourable 
case where the removal is restricted entirely to one class, say the X class, i.e. R, = 0. Then 


R\t 
m= n(1— 3) 


the ratio n,/n, is given in Table 1 for various values of R/X,, the proportion of the X class 
removed. Since the removal will rarely exceed 50 % of the removed class, this table suggests 
that if Nj alone is to be estimated, n, should be chosen only slightly smaller than np. 


Table 1. Optimum sample ratio n,/n, for the estimation of N 
in case there is no removal in the Y class 
R/X,y 0-5 0-4 0-3 0-2 0-1 0-05 0-02 0-01 
N,/No 0-71 0-77 0-84 0-89 0-95 0-975 0-990 0-995 


If X, is to be estimated, the situatior is slightly more complicated, but a qualitative rule 
is possible. Recalling o7(X,) from formula (3), this is minimized with n, + n, fixed if 


a (EHy'2 s N, (3 


Xo¥o) PL M\Xi¥) 








No 


(10) 
Again, consider the case Y, = Y, and hence R, = R. Then the optimum choice of n, and 


m, is determined by 


which is shown in Table 2. 


Table 2. Optimum sample ratio n,/ng for the estimate of X 
in case there is no removal in the Y class 











R/N y=r 
P,=X,/No 
0-25 0-20 0-15 0-10 0-05 0-02 0-01 
0-1 _ —_— — — 1-344 1-096 1-043 
0-2 —_— — 1-700 1-273 1-097 1-033 1-015 
0-3 1-838 1-386 1-202 1-102 1-041 1-014 1-007 
0-4 1-225 1-132 1-075 1-039 1-016 1-005 1-065 
0-5 1-061 1-032 1-016 1-007 1-001 1-000 1-000 
0-6 0-982 0-980 0-982 0-986 0-993 0-997 0-998 
0-7 0-935 0-947 0-959 0-972 0-985 0-994 0-997 
0-8 0-905 0-924 0-943 0-963 0-981 0-993 0-996 
0-9 0-882 0-907 0-931 0-954 0-977 0-991 0-996 
































Table 2 suggests that in a few extreme cases the second sample should be considerably 
larger than the first for an optimum estimate of: X,; for most cases, however, the samples 
should be chosen almost equal in size. Since often the biologist will wish to know both 
X, and N, it would seem that choosing n) = n, represents a near optimum allocation of 
effort. 








284 Population estimation based on change of composition 


4. COMPARISON OF CAPTURE-RECAPTURE AND CHANGE 
OF COMPOSITION ESTIMATION PROCEDURES 


Rather than merely sampling and classifying as X or Y n)+n, animals, the experimenter 
may tag or mark an equivalent number. The removal process will then act as a sample to 
estimate the tag ratio. It is perhaps reasonable to disregard the additional effort necessary 
to record accurately the tag recoveries. However, it is not reasonable to disregard the fact 
that tagging may be an operation that is more expensive, or requires more effort than 
classification. 

In view of the result of the last section it is reasonable to consider the case n, = n, = n 
(say). For simplicity we restrict consideration to R, = 0, R, = R. Denote R/N, by r. Then 


a Y,[X9+Aq— Rl (1 —7r\2 
mite) = tae |) 


= ST: (11) 
n{1—-Pll r 

Now consider the estimate that could be made if 2An tags were placed in the popula- 
tion, where A~' is the ratio of the cost of tagging to the cost of classification and hence may 
be assumed to be greater than or equal to 1. 

The random tag recoveries follow a hypergeometric distribution under the usual assump- 
tion of random sampling without replacement: this hypergeometric distribution is often 
approximated by a Poisson distribution (e.g. Chapman, 1948). Here, if m is considered to 
be small relative to N, while R is not necessarily so, then it may be shown by the usual 
procedures (using Stirling’s formula for the large factorials appearing in the hypergeo- 
metric formula) that the distribution tends in the limit, as N +00, while n/N ~0 and R/N 
is bounded away from zero, to the binomial distribution with parameters (n, R/N). If this 
limiting distribution is used as an approximation to the true hypergeometric distribution, 
the asymptotic variance of the estimate of N, based on the tag recoveries (say N,(t)) is 








ee N2(N,—R N2(1—r 
a*(N o(t)) = ~so— = oe (12) 


Hence we may ask under what conditions on A, r and P, is 


l 2P)-r (A= 


27 \T—P NG (13) 
This inequality holds provided 
r+2raA(l1—r) 
*o< X= r)+r (14) 
> ™ 
Let Fe pe wen oem. (15) 


~ 4A(l—r)+r° 


We observe that f(r,0) = 1, while f(r, 1) is a monotone function of r with f(0,1) = 0, 
f(1, 1) = 1. The first statement in this sentence is the trivial remark that the change of com- 
position method must always be better if tagging is prohibitively costly. The second case, 
where tagging is no more costly than classification, is of some interest; we consider this 
case in the next paragraph. 





<P PRT 


sy rs VV 


t) 


D) 


fo ROTTER 


Dovetas G. CHAPMAN 285 


Elementary considerations show that for r<1, P,</f(r,1) implies Py)<r. Since the Y 
removal is zero, it is impossible for the proportion of the population removed to exceed the 
original proportion of X’s. Consequently, if tagging is no more costly than classification, 
from the large sample point of view, the tag sample procedure is under all circumstances 
more efficient than the change of composition estimation method. 

Turning to the case A > 1, we have used a different method of comparison. Tabled below 
is the minimum value of 1/A such that o( N,(t)) > o*(Np) for specified values of P,. 


Table 3. Minimum value of relative cost of tagging to cost of 
classification such that o?( N,(t)) >o 2(No) 











R/No=r 
X~p/No=Po 
0-25 0-20 0-15 0-10 0-05 0-02 0-01 
0-1 — == == —- 6°34 19-60 41-80 
0-2 = — 3°54 6-76 16-62 46-55 96-53 
0-3 3-00 4-58 7-28 12°86 29°85 81-20 166-90 
0-4 5-51 8-00 12-28 21-00 47-49 127-40 260-70 
0-5 9-00 12-80 19-27 32-40 71-20 192-08 392-04 
0-6 14-25 20-00 29-75 49-50 109-25 289-10 589-05 
0-7 23-00 32-00 47-22 78-00 171-00 450-80 917-33 
0-8 40-51 56-00 81-16 135-00 294-51 774-20 1574-10 
0-9 93-00 128-00 87-00 306-00 332-51 1744-40 3544-20 
































Since the comparison of these two procedures is made under the conditions most favour- 
able to the change of composition technique, these results suggest that in almost all cases 
the capture-recapture estimation procedure will yield more information for the same 
amount of effort. However. it is to be noted that the capture-recapture estimates are 
seriously dependent on the assumption that the initial capture dos not alter the behaviour 
pattern of the animals in any way that will affect their probability of recapture. This 
assumption is difficult to test and it is certainly not always reasonable. 


5. ESTIMATION WHEN R, AND R,, MUST BE ESTIMATED 


In the field, many additional complications may arise in the application of the change of 
composition estimation procedure. Often 2 is known reasonably well, but R, and R, are 
not; it may even happen that F itself must be estimated. These complications do not change 
the estimates but will modify the estimated variances and any interval estimates. 

We study the case where R is known but R, and R, must be estimated from a sample of 
size m from the removed group. We assume that a sample of size m is taken randomly, and 
that of the m animals, m, are from class X, m, from Y. Then the estimates of X, and N,, 
denoted by X,, Ng, are simply (1) and (2) with. R,, replaced by R.,, where 


A m,R 


Rk, ==. (16) 











286 Population estimation based on change of composition 


While the asymptotic variances of X, and N, may be determined by inverting a 3 x 3 
information matrix, it is easier to compute them directly, for X, and N, are linear functions 
of R,. Thus 


Ae, XY, RLR Pi\1-P P,A-P, 











This is the asymptotic variance, asymptotic in the sense that terms of higher order in 
no‘, nz? are neglected. 

A similar formula can be derived for 7?(X,), and the same method can be followed through 
in the case that # itself has to be estimated in some manner. It is of course assumed that the 
estimation of R is independent of the random variables x9, x,. The assumption is implicit 
in the development above. 


6. COMBINATION OF CAPTURE-RECAPTURE AND CHANGE OF COMPOSITION ESTIMATES 


It was noted in § 4 that, if the experimenter is prepared to accept the assumptions under- 
lying the capture-recapture estimates, then he will gain more from putting his effortinto 
tagging animals rather than merely classifying them. Actually, however, he may try to do 
both, for in this way he may get two different estimates, or the combined procedure may 
yield the most efficient single estimate. Moreover, the biologist will often wish to sample the 
population on more than one occasion, in order to study growth or other changes in the 
population. Consequently if the members of the initial sample are tagged as the basis of 
a tag-sample estimate, the normal pattern of the study of the population will make possible 
also the change of composition estimate (assuming of course a selective removal between 
the samples). 

In this situation the first problem that confronts the experimenter is to determine 
whether there is reasonable agreement between the two possible estimates. As noted above 
in § 3, the binomial model with parameters (”), R/ Np) is a reasonable approximation to repre- 
sent the capture-recapture procedure. Since the large-sample variances are known for the 
change of composition estimate and for the estimate based on the number of tags recovered 
(cf. formula (12)), a large-sample test is available to compare the two separate estimates. 

If the estimates are not compatible it may be due to failure in the assumptions in one 
or both models. These assumptions may be correct in regard to expectations, but the 
sources of variability may be greater than the models permit, i.e. the distributions are not 
binomial. This possibility will be considered in the next section. 

On the other hand, there may be sources of error, such as were discussed in § 2, or are 
well known for capture-recapture estimates. These have been particularly well pointed out 
by De Lury (1954). It is pertinent to note that the most important sources of error tend to 
bias the change of composition estimate downward, while the important factors of trap- 
shyness and tag mortality will tend to inflate the tag sample estimates upwards. 

If the estimates are compatible, there is a suggestion but certainly no proof that the 
assumptions of the two models are fulfilled. In this case it will be normal practice to com- 
bine them. They may be averaged, weighting them inversely to their asymptotic variances. 
However, for this simple model the maximum-likelihood equations are easily solved. We 
need to add to the notation of § 2 the following: 


s = number of the ny animals taken in sample 0 
which are recovered in the removal process. 





Dovetas G. CHAPMAN 287 


It is assumed of course that these n) animals have been made identifiable by tagging or 
marking in some manner. Then 


reer AVEC OR as 


The maximum-likelihood equations for X, and N, are seen to be of the same form as those 
of the simple model of § 2. They may be written in the form 


ase. is 
Yo 2M Yy_M— Mots _ 
: ks." (29) 


and solved by analogy with those of the simple model. The maximum-likelihood estimators 
are 





X¢ = Re a, 21 
7 (1 — Ny +8) Xo — 22, ™ 

















Ng = 2nol (1 — My +8) R,— x, R] (22) 
(21 — Ny +8) Ly — 2Ny2, 
The information matrix is 
No Ny | ORG O SE 
Mele Aa43 N,¥, N.Y, 
; (23) 
No Ny Mo Xo mA, 4 
NY, NY,’ NY, NY, 
nok 
A= 24 
where A NEN, (24) 
. _ py “2 
Hence (Asymptotic) 72(N) = (4o—Fi) Net (25) 


Kolo, Mik, NBN, 


No Ny 


An analysis of (25) shows that the optimum allocation of n, and n, is made by choosing 
NM) the maximum possible, n, the minimum. This, of course, is necessarily true from the 
results obtained in § 4, if the differential cost of tagging is disregarded. Consequently the 
combined procedure outlined here will be used if there is reason to question the assumptions 
of the tag-sample procedure, or if further sampling after the removal process will be made 
for other reasons. In this case the question of optimum allocation does not enter. 


7. EXTENSIONS FROM THE SIMPLE BINOMIAL MODEL 


There are several immediate extensions or generalizations of the model considered in § 2. 
For example, there may be several samples with one removal or several removals with one 
pair of samples, or several samples and several removals. The first two of these extensions 
are trivial; the several selective removals between two random samples may simply be 
combined by addition and the original model is obtained. If there are several samples either 
before or after a single removal, they may be combined in the usual manner that samples 
pertaining to the binomial model are combined. 














288 Population estimation based on change of composition 


Where there are several selective removals, each followed by a random sample, a new 
situation arises. This is of some interest, for the sampling process might be combined with 
the removal process to lead to such a situation, i.e. the sampler might only return the 
members of the Y group to the population (though in general it is to be expected that the 
experimenter will prefer to return both X’s and Y’s marked or tagged). 

The following additional notation is needed: 


R,,: removal of X population prior to ith sample, 


R,,: removal of Y population prior to ith sample, 


vi 
R,=R,,+R,, (= 0,1,2,...,2). 


Also X,=X,-Rk,, N= N—R,;, ete. 
The maximum-likelihood equations for X, and N, may be written down simply as 
oe 
oad s, 
i=04i 8 i=04% (26) 
reas % 
imo%, iso, 


These equations can be solved by iterative methods in any particular case. In case the 
R,,, R,, are small relative to Xo, Yj, the following approximate solution may be employed 
to yield estimates without the labour of the iterative procedure. The logarithm of the likeli- 
hood function may be written 


k R k R k R: 
inka K+ z,In (P)— 33) + jin (1-7-5) - nln (1-37) 
x. 0 Np, zy 0 No p> A 














i i=0 
kk Bh ot Be yy, R,, k n,R, 
= K+ ¥2,InP,- i+ Sy;In(l1—P,)—- Ste + a> 27 
os P ms NP) =o" Pp coat (L-Py)No  i=o No : 
so that the likelihood equations become 
(1—Po) © 2, R,,+Po Dy, R,,— Poll —P,.) +7, R; = 9, (28) 
i=0 i=0 i=0 


k k k 
>> x x; R,, x ¥; x y:R,, 
geet —_, — ict... “—- 0 —— =0 (29) 





P, may be determined from the quadratic equation (28) and this in turn substituted in (29). 
The result is a simple linear equation for Ny. 

The possibility mentioned earlier that the selective removal might be performed by the 
sampler suggests a further possibility—to set up a sequential procedure for sampling, 
removing the X group in the sample from the population and so on for k steps where k is 
determined by the actual observations. This would be analogous to the sequential procedure 
‘ proposed by Goodman (1953) for the tag sample method of estimation. 

Perhaps more important than any of these extensions is to consider the situation where 
sampling that confirms to the binomial model is impossible. Schooling and segregation by 
size or sex or age will certainly be frequently encountered in many natural populations, so 
that factors beyond the control of the experimenter make invalid the assumptions of the 





sh 


1e 


27) 


28) 





Dovuatas G. CHAPMAN . 289 


binomial model. This will be true even when he is cognizant of the principles of random 
sampling. Since these additional factors tend to produce more variability from sample to 
sample than is ascribable to the binomial model, it is necessary for the experimenter to 
replicate his samples to vbtain an internal measure of this variability. 

Consider then several samples n;;(¢ = 0,1; 7 = 1,2,...,4;),each composed of x;; members 
of the X class and y;; members of the Y class. To estimate X59, N, a regression technique may 
be used that has some resemblance to De Lury’s procedure for population estimates from 
the catch per unit of effort data (De Lury, 1947). To do so it is necessary to postulate that 


the random variables p;; = x,;/n,; are approximately normal with mean P, and variance o%;. 
Now define 

Zo; = Xo— Poi No: | 

2,4 = Xp—Rz—Py\No— R).J 


Then E&(Z;;) = 0, 07(Zq;) = NG§%;, 97(Z,;) = (Ny— RB)? o3;. (31) 


(30) 


If the underlying heterogeneity is considerable it is reasonable to assume that o7; = o? for 
all i, 7: though if the n,,; differ considerably some note should be taken of this, presumably 
following the ideas of Cochran (1942). Also for simplification the factors Nj, (Nj— 2)? in 
the variances of the Z;,’s are neglected. The maximum-likelihood equations for X, and Ny 
then reduce to a form almost identical with (1) and (2), provided the single observations 
there are replaced by means. 


ke 
Xx Pi 
Thus letting p; = ot a , 
i R, ¥ R . 
je ate = (32) 
Po.— P1. 


while by routine methods it is found that 


2 J2 a 2 
2 le eo). (33) 


(Asymptotic) 77(Ny) = (P— Bp ke + i, 
It should be pointed out that (33) is a large sample variance if ky, k, are large. not 9, n, 
as was the case with respect to formulas (3) and (4). 
Formula (33) may be used to estimate the magnitude of the sampling required for a 
given removal. or the magnitude of the removal for given sampling, to achieve an estimate 
with preassigned precision. Thus if R, = 0, ky = k, = k (say). 


A I 
(Asymptotic) o2(N,) = ; (34) 


qa-PA)? 
Unfortunately, it will often be true in advance that so little is known as to the possible 
range of values for 7. that this formula cannot be used even to give qualitative results. 

This procedure can be extended to the case of several selective removals. with a non- 
selective sample taken between each removal. No new theory is involved although the 
algebra becomes somewhat more tedious. However. such a process must necessarily take 
place over time where mortality and possibly emigration and immigration should be 
considered. It is therefore thought desirable to defer consideration of this case to a later 
study, where the restriction that the population is closed, is removed. 

It is planned to illustrate the methods developed above in a further paper, using appro- 
priate data concerned with fish populations. 











290 Population estimation based on change of composition 


REFERENCES 


CuapMan, D. G. (1948). A mathematical study of confidence limits of salmon populations calculated 
from sample tag ratios. Bull. Int. Pacif. Salm. Fish. Comm. no. 2, pp. 67-85. 

CHAPMAN, D. G. (1954). The estimation of biological populations. Ann. Math. Statist. 25, 1-15. 

Cocuran, W. G. (1942). Sampling theory when the units are of unequal sizes. J. Amer. Statist. Ass. 
37, 199-212. 

De Lory, D. B. (1947). On the estimation of biological populations. Biometrics, 3, 145-67. 

De Lury, D. B. (1954). On the assumptions underlying estimates of mobile populations. Statistics 
and Mathematics in Biology, pp. 287-93. Ames, Iowa: Iowa State College Press. 

Goopman, LEo A. (1953). Sequential sampling tagging for population size problems. Ann. Math. 
Statist. 24, 56-69. 


ScaTTERGOOD, L. W. (1954). Estimating fish and wildlife: a survey of methods. Statistics and Mathe- 


matics in Biology, pp. 273-85. Ames, Iowa: Iowa State College Press. 








en 


ed 














[ 291 ] 


AN AGE-DEPENDENT BIRTH AND DEATH PROCESS 


By W. A. O’N. WAUGH 
Royal Military College of Science, Shrivenham 


1. INTRODUCTION 


Many examples of the stochastic processes known as branching or population processes 
have been developed. The assumptions leading to the various models have been decided 
partly by mathematical convenience and partly by the attempt to reproduce features con- 
trolling the development of biological or physical populations. The model we shall develop 
involves a continuous time parameter and an enumerable set of states, and we can compare 
it with several other processes of this type, e.g. the Yule—Furry birth process, Feller’s 
birth and death process (see Feller, 1939, 1950), and the non-Markovian birth process due 
to D. G. Kendall (1948). 

There are two motives for the development of the present model. Mathematically, it 
furnishes a conveniently soluble example of a non-Markovian process, in which transitions 
from a given state may be to one of several other states, and in this it is a generalization of 
Kendall’s process in which transitions are always from the given state to its successor, 
i.e. from state number n to state number 7+ 1. For purposes of application it is based on 
assumptions which are not proposed as ideal but which bring in features which are perhaps 
a step in the direction of realism. Thus it may help in the development of yet more realistic 
models when they are required, and in indicating both the scope and the limitations of the 
models constructed under an assumption which has been widely adopted in population 
problems; viz. that the individuals reproduce independently of one another. 

Bellman & Harris (1948, 1952) have described a stochastic process similar to the one we 
shall examine, and we shall adopt their notation. They consider populations of particles 
which reproduce by fission independently of one another. They suppose that a ‘generation 
time’ is defined for these populations which is a random variable having a general distribu- 
tion function G(t) (which does not depend on the size of the population, in accordance with 
the assumption of independence just mentioned), so that dG(t) expresses the probability 
that a particle born at time ¢ = 0 ends its life in the interval (t, t+ dt). 

They assume that with given probabilities q, 91,92, ... amy particle whose life has just 
ended is replaced by 0, 1, 2, ... new particles. 

Their detailed work is carried out for the pure birth process involving binary fission only: 


@=1, H=N=G%=--- = 9. 


They state that their results can easily be generalized to other modes of fission. Note that 
when q,+ 0 the process is a birth and death process. Further developments in this direction 
have also been obtained by Ramakrishnan (1951) and by Reid (1953). 

It may be convenient to note here that a focal point of the present investigation will be 
the coefficient of variation of the population size, considered as a function of the time. 
The importance of this function in the study of populations has been emphasized previously, 
e.g. by Kendall (1952a). 








292 An age-dependent birth and death process 


2. AGE-DEPENDENT PROCESS INVOLVING TWO POSSIBLE FATES: DEATH OR FISSION 


For clarity and because it leads to a model which may be useful in applications, we shall 
not adopt, in the main part of our investigation, the most general assumptions that are 
possible. We shall indicate at various points throughout the paper certain further general- 
izations that merely involve difficulties of technique and alterations in detail which might 
be worked out if applications seemed to demand it. 

We shall suppose that at the end of its life an individual either disappears or is replaced 
by two new individuals (i.e. death without issue or reproduction by binary fission). These 
two new individuals then develop and reproduce independently of one another in the same 
way as the parent individual, i.e. if the parent individual split at time ¢ to form them, then the 
probability that either of the two new individuals ends its life during (¢+7,t+7+dr) is 
dG(r), the distribution G(r) being fixed, and the same for all individuals of the population. 
In the notation of Bellman & Harris this would be equivalent to putting 

Go#9, Ge#9, 9, =9g=.--. = 9, 

but we shall make the more general assumption that these probabilities depend on the age 
at which the parent individual’s life ends. A further generalization to admit the possibility 
of an individual’s facing more than two alternative fates is quite simple. We shall suppose 
that at time ¢ = 0 the population consists of just one newly born individual. Let us name the 
alternatives which face an individual of the population; risk (b) is the event that its life ends 
with the birth of two new individuals, risk (d) is the event that it dies without issue. We shall 
derive the functions q,(t) and G(t) as follows. Suppose that there are two functions A(t) and 
p(t) such that, given that an individual born at time 0 is alive at time t, 


Pr {individual succumbs to risk (6) in (t,t 4-dt)} = A(t) dt + o(dt), 
Pr {individual succumbs to risk (d) in (¢,t+dt)} = v(t) dt+o(dt), 
thus Pr {individual’s life ends in (t,t + dt)} = [A(t) + v(t)] dt + o(dt). 


The function [A(t) + (t)] might be called the ‘force of mortality’. In the argument which 
follows, and which leads to the fundamental integral equation, it is convenient if A(t) and 
p(t) are both continuous and do not vanish together at any point in 0 <¢< oo and are such that 


t 
{ [A(u) + v(w)] du —> oo as too. In the latter part of the paper we assume a certain form for 
0 


A(t) and p(t), and it happens that these requirements of continuity, etc., are fulfilled. How- 
ever, it is possible to extend the argument to cover cases in which certain discontinuities 
are allowed, in particular the case where an instantaneous chain reaction can occur at the 
moment of fission (a possibility which is mentioned by Bellman & Harris (1952)). In this 
case our assumption of age-dependence permits the probabilities governing the numbers 
of progeny at instantaneous fission to be different to those corresponding to fission after 
a finite life length. 
From A(t) and v(t) we obtain the following conditional probabilities, given that the 
‘ individual’s life ends in (t,t + dt): 
‘ . i _ A(t) 
Pr {life ends by risk (6)} = q,(t) = MO +r)’ 
v(t) 








co 


an 














W. A. O'N. WavuGuH 


babe gags 1—exp|- | [A(u) + »(u)] du 


= Pr {individual born at time 0 lives for a time less than or equal to ¢}. 
We can imagine an idealized population in which either the risk (d) is suspended and the 
population develops under risk (b) alone (v(t) =0) or vice versa. Then 


B(t) = 1-—exp | - [/Aqwau| 
| Jo J 


Pr {individual born at time 0 lives a shorter time than t, when risk (d) is suspended}, 


. 


t 
D(t) = \—exp|—| v(u) du 
0 





= Pr {individual born at time 0 lives a shorter time than ¢, when risk (b) is suspended}. 





cd dD dB 
If att) = 229, ae) = 220, 4p) - 20, 
_ dt) [1— Bit)] 
then o(t) a — 
_ (t) (1— D(t)) 
and a) =a 
and also 1— Q(t) = [1— B(t)] [1 —_D(a)]. 


3. THE INTEGRAL EQUATION FOR THE GENERATING FUNCTION OF THE POPULATION SIZE 


In order to obtain the mean and variance of the population size as functions of the time we 
start from an integral equation for the generating function of the distribution of the popula- 
tion size, which is very similar to the one from which Bellman & Harris (1952) derive their 
results. It is possible to give a definition of a suitable sample space Q and a function r,(w) 
over 2 which may be called the population size at time t, and by considerations of measur- 
ability to show that {r,: t > 0} is a well-defined stochastic process, and furthermore to derive 
the fundamental integral equation from this definition. However, we wil] content ourselves 
with this reference to the more rigid approach and will merely give the following intuitive 
argument which leads to the integral equation. 

Let p,(t) be the probability that the population size is r at time t, given that the population 
consisted of a single newly born individual at time 0. 


Let $(z,H= Dd p,(t)2 =H", 
r=0 
and let h(z,t)= ¥ q,(t)2". 
n=0 


Then the situation at time t may have come about in either of two ways. 
(i) The initial individual may still be alive. 

(ii) The initial individual’s life may have ended in the interval (u, u+du) where 0<u<t. 
In this case, if its life ends in fate (6), then because of the independence of the members of 
the population and the fact that all individuals are assumed to develop under the same 
probabilities as regards life length and number of progeny, we have effectively two popula- 








294 An age-dependent birth and death process 


tions developing each for a period ¢ — u, from initial ancestors newly born at time u. On the 
other hand, if its life ends in fate (d) then the population size at time ¢ is zero. 
Hence we have the integral equation 


$(z,t) = , [o(t) + g(t) %(2, t—u)] g(u) du+2[1 — G(2)], 


= [no(2,t—w),u]g(u) du+2{1 4) 


the latter form of which can easily be seen to hold when modes of fission other than binary 
are considered. 
In terms of the functions B(t), D(t) and their derivatives this can be written 


t t 
(z,t) = I, [d(z,t—u)]}* b(u) [1 — D(u)] du +[. d(u) [1 — B(u)] du + 2[1 — B(t)] [1 — D(é)]. 


If we differentiate this equation with respect to z and then put z = 1 we get the following 
equation for the mean population size at time t, 


t 
putt) = 2] wy(t—w)b(w) (1 -D(w)] du + [1 - BO] DE), 
while if we differentiate twice with respect to z and then put z = 1 we get the equation 
t t 
H(t) = 2| [u,(t — u)]? b(w) [1 — D(u)] du + 2| f(t —u) b(u) [1 — D(u)] du, 
0 0 


for the second factorial moment of the population size. 


4. THE OBSERVABLE LIFE-LENGTH DISTRIBUTIONS AND THE PROBABILITY OF EXTINCTION 


In experimental observations we should probably be able to separate the data on life length 
into those for individuals ultimately suffering fate (b) and those for individuals ultimately 
suffering fate (d). These observable distributions will be related to the distributions B(t) 
and D(t)in a similar manner to that in which crude rates are related to net rates in a problem 
of competing risks. 

Let B*(t) be the probability that an individual newly born at time 0 suffers fate (b) before 
time ¢t. Then B*(co) is the probability that a newly born individual will ultimately suffer 
fate (5). 

Let B*(t|b) be the probability that an individual newly born at time 0 and known 
ultimately to suffer fate (b) has a life length less than or equal to ?. 

Let D*(t), D*(co), D*(t| d) be the corresponding probabilities for fate (d). 

The probability that an individual born at time 0 suffers fate (5) in the interval (¢, t+ dt) is 


[1—G(t)] A(t) dt + o(dt). 
Hence B*(t) = fa — G(u)] A(u) du, 
0 
and, since B*(t) = B*(co) B*(t| b), 


t rs) 
B*(t| 5) -{iu —G(u)] rw) dul |’ [1 —G(u)] A(u) du, 











Pa 


ne 


ng 


vn 


)is 











W. A. ON. Wavuau 295 


with a similar result for D*(¢|d). It will be desirable in any attempt to fit the model to 

experimental data to give to B*(t| 6) the form of the observed distribution of life lengths 

for those individuals whose life ultimately terminates in fate (b), and similarly for D*(t| d). 
We can express B*(t) in terms of B(t) and D(t) as follows: 


Since A(t) = b(t)/[1 — B(t)] 
and 1-—G(t) = [1— Bit) [1 — De), 
therefore B*{t) = | {1 —D(u)] 6(u) du, 
0 
and similarly D*(t) = [. [1 — B(u)] d(u) du. 
0 


From the rigorous definition of Z(t) (the population size) as a random variable r,(w) over 
a suitable sample space Q, to which we have referred, we can show that 
Pot) = Pr {Z(t) =0} 


is a solution of the integral equation 


t 
pat) = | {aol + aa(u) [pelt —w) Fu) da. 


Clearly o(t) must be a bounded solution because we must have 0 < p(t) < 1, and, because 
Pr {Z(t.) = 0} > Pr {Z(t,) = 0} for t, > t,, po(t) is a monotonic increasing function of t. Further- 
more, it can be shown, by a method due to Bellman & Harris, that the bounded solution of 
the integral equation is unique. 

Hence p,(t)+ P< 1 as too. 

Furthermore, p(t) is the sum of an indefinite integral and the convolution of a bounded 
monotonic increasing function with a continuous function. Hence p(t) is continuous. 

Let p,(t) = P —d(t) so that d(t) | 0 as too. 

The integral equation can be written 


t t t 
P—8(t) = [) {golu) + Prza(u)} glu) du—2P | 8(t—u) glu) g(u)du + | %— w) glu) g(u) de. 


t 
Note that | qo(u) g(u) du < 1, that d(t) < 1, and that by Cauchy’s principle of convergence 
0 


t 
i} qo(u) g(w) du < €, where ¢ is any positive constant, for any ¢ > s when s is sufficiently large. 
8s 


Consider the second term in the equation above and divide the range of integration into two 
parts 


8 t 
i * a(t —w) ge) g(u) du + { B(t—u) gale) glu) du 
<{" O(t— u) qo(u) g(u) du + [ao g(u)du (because 0 < A(t) < 1), 
0 8 
<{ d(t—u)q(u)g(u)du+e for sufficiently large s, 
0 “ 


< ef aatu) g(u)du+e for sufficiently large ¢>s, 
0 


< 2e. 


a0 Biom. 42 











296 An age-dependent birth and death process 


This argument and another similar one show that the last two terms on the right-hand side 
of the above integral equation tend to zero as t->0o, and so we get 


P= | © (qu) + Prag(u)}g(u) du, 


Let p= e qo(u)g(u)du and o= [, o(u) g(u) du, 


t 
then, since by the assumption that | [A(u) + v(u)]du—oo as too, the function G(t)+ 1 
0 


a tes P=p+oP*, where p+o=1. 
The roots of this quadratic are P = 1, P = p/o. If p>o we must therefore have p(t) > | 
because the limit of yo(¢) is not greater than 1. 

On the other hand, if p < c, the limit of p,(t) might be either p/o or 1. 

Suppose the limit of p,(t) is 1. Then, for some t,, 0<t, <0o, we must have 


Pots) = 9, where p/o<6@<1. 
Then, for this value of ¢, 


ty 
6= r {Qo(%) + Go(u) [Molt — )]*} g(u) du 


i 
<| dole) + 6¥4a(u)} 9(u) de. 


Now if the upper limit of integration on the right-hand side is increased, the right-hand 
side cannot decrease. Furthermore, if g(t) g(t) or q2(t)g(t) is non-zero for t>t,, it must 
increase, if we increase the range of integration to (0,00). 

In this case 0<p+o0?, 
i.e. of? —0+p>0. 


But 6 lies between the roots p/o and 1 of the left-hand side, which is a quadratic expression 
that must be negative inside the interval (p/o,1). Hence we have a contradiction, and 
therefore p(t) cannot take any value in the interval (p/c, 1). 

On the other hand, if q9(¢) g(t) and q,(t) g(t) vanish identically for ¢>t,, we may have the 
equality 6 = p+062, 
but this implies 0 = p/o or 0 = 1, which again contradicts the assumption p/o <6 < 1. 

Hence we see that o(t) + min {p/c, 1} and this limit is the probability of extinction. 


5. EXAMPLE: THE SIMPLEST MARFOVIAN BIRTH AND DEATH PROCESS 


To illustrate the methods of the preceding sections we shall obtain the integral equation, 
etc., for the birth and death process in which the probabilities are as follows: 
For a particle alive at time ¢ 


probability to split into 2 new particles during (t,¢+dt) = Adt+o(dt), 
probability to die during (t,t + dt) = vdt+o(dt), 


where A and v are constants. 





an 


an 


st 


mn 
d 


Le 





W. A. O’N. WavueH 297 
In this case b(t) = Ae~“ and d(t) = ve. Hence we obtain 
g(t) = (A+v)eOm, 
got) = v[(A+v), alt) =A/(A+), 
and the integral equation is 
P(z,t) = I, [v+AgP%(z, t—u)] eH du +z e-At, 


For the observable life-length distributions we have 


B*(t|b) = | Neti de / | ” Nertomn du 
ae “4 e-A+v)t : 
= G(t), 
and the same result holds for D*(t|d). For the extinction probabilities since 
o = B*(0) =A/(A+v) and p= D*(co) =v/(A+p), 


the probability of extinction is equal to 1 if A<v and is oo, to v/A otherwise. These 
particular results are of course all well known. 


6. CHOICE OF A PARTICULAR FORM FOR b(t) AND d(t) 


Although the functions that appear in the basic statement of the birth and death process 
are A(t) and v(t), it is a clearer way of indicating the type of age dependence we are con- 
sidering if we give first the frequency functions of the life-length distributions. In the 
ensuing sections of this paper we shall take 


a. 
(k= 


om 


if 1 g—kAt 


b(t)= 





d(\=—, 


, gest enmt 


where A and v are positive constants. 

Probability distributions of this type have been used before in population problems (see 
Kendall, 1952) and also (much earlier) in the theory of queues (for an account of this see 
Erlang, 1948). A ‘microscopic’ interpretation of them has been given in the birth process, 
where only one fate is possible for each member of the population, i.e. in the binary fission 
process with no death effect. This interpretation is that at the birth of a new particle a process 
is started within it, which must go through a fixed number, k, of successive distinct phases. 
The times spent in the successive phases are independent random variables, each having 
distribution function 1 — e-*", When the last phase is completed the particle subdivides and 
forms two new particles. It follows that the ‘generation time’ is distributed like y3,/(2kA). 
Note that in giving this interpretation we do not intend to imply that if a real population 
is found to have a generation time with this distribution then a process of k phases is 
necessarily going on within the individuals. Our reason for adopting this form of distribution 
of generation time is that it appears to give a fairly good fit in certain populations (see, for 
example, Kendall (1948) and recent experimental results by Dr E. O. Powell (1955), and a 
real physical multiphase process may or may not be held to account for this. 


20-2 











298 An age-dependent birth and death process 


We can extend this multi-phase interpretation to our model if we consider that at the 
birth of a new particle two such processes start within it, one being of k phases and the other 
of m phases. We suppose that the fate of the particle is decided as follows: if the k-phase 
process (which we might call ‘maturity’) is completed first, the particle undergoes fate (6) 
at the moment of completion, if the m-phase process (which we might call ‘decay’) is com- 
plated first, the particle undergoes fate (d). 

An important property of the above distributions is that as the ‘number of phases’ 
k (or m) tends to infinity the distributions tend towards the corresponding deterministic 
distributions, while for all k (or m) they have the fixed mean value 1/A (or 1/v). 

It follows from the above choice of b(¢) and d(t) that 














k- 
Bi) 1—e™| 1 thats... |, 
= mv) (mvt)y—1 
kA (kAt)k- 
A(t)= . 
1+aM+ +o 
3 we (mvt)m—1 
<., 1)! Gavia 
+ mv tot Ge = 


Making use of properties of the above forms of B(t) and D(t) we shall express each integral 
equation in terms of convolutions of functions, and then reduce it to a linear differential 
equation with constant coefficients. The subsequent solution depends on solving the 
characteristic equation explicitly, and hence the equation for the mean population size. 

In the next two sections we shall suppose that k> 1, but that m = 1. Roughly speaking 
this assumption means that births occur in an age-dependent manner but deaths occur at 
random. Because a complete solution is available and because the assumption may well be 
realistic in some applications, we give the explicit solution for the mean and derive the 
variance and coefficient of variation of the population size from it. We can compare our 
results (i) with Kendall’s (1948) uniform multi-phase pure birth process by putting v = 0, 
and (ii) with Feller’s Markovian birth and death process by putting k = m = 1. This leads, 
of course, to the simple form of this process in which the birth and death rates do not depend 
on the population size. 

When the number of phases in both A(t) and D(t) exceeds 1 an explicit solution of the 
characteristic equation is not readily obtainable, so we give asymptotic formulae for the 
mean and variance of the population size in terms of the dominant root of the characteristic 
equation. 

We shall find the following notation convenient. 

Let (a) = ae, 


* Let [ fe-wyedu = f(t) * (a). 


Thus ate~ = («)* (a). We shall write (a) *(a«) = (a)**, and, extending this analogy 
with powers, 


ei ema! = (ar) # (2) #... # (at) = (a)*™. 





be 
ur 


fo 


tr < 


of 





W. A. O’N. Waveu 299 
—~__yn-1 mt * 
I, eS nie” e~™ du [@ "du 
= 1x (a)*" 
be the distribution function of },3,,/«, where 1 is the function that is constant and equal to 
unity for all t in —0<t<oo. 
Let kA+mv = y. 


In this notation we can express the functions which occur in the integral equations as 
follows: (k Aye 


(k=1)1 
1 yk k+1 
” (kay ee %(e—1)i i”, ee my as He-*+... 


(mv)™-1 a +1)...(4+m—2) yktm- 
(m—1)! far (k+m-— 2)! 


=(F) [oes hoes... 


mv\™"* k(ke+ 1)... (KAM+2) gee m—2) 
+(2) it wae al | 
mv 


Similarly d(t)([1—B(t)] = (~)" Ce on — (y) ont + 


(mv) 


(m—1)! 





b(t) [1 -D(®)) = en [ee 14 mutt +... + rents] 





tk+m—2 ex 








kn kl m(m + 1) vee (m+k— 2) *(m+k—1) 
+G) (é—1)! ia | 


The observable life-length distributions corresponding to probabilities of the type we 
have adopted are given by 


kr mv k 
*(4) = ke k+1) 
B*(t) (=) [rece 5 ae 2 y1 player t+. 


mv\"—* k(k + 1).. (k+m+2), 
+(F) (m—1)! 


whence B*(co) follows on noting that 1 * (y)**t+ tends to 1 as too for j = 0,1,...,m—1. 
Hence we can see that the distribution function B*(é| b), and similarly D*(t| d), is the sum 
of constant multiples of several distribution functions each of x? form, i.e. it is a mixture of 
x? variables (see Robbins, 1948). 

The probability of extinction is equal to 1 if 


kA\* mv k mv\"-1k(k+1)...(k+m—2) 
GP Gy ae 





La (y)nenn] . 








Note that when k = m this is equivalent to v>A. 

The expression above consists of the first m terms of the negative binomial distribution 
whose generating function is p*(1—gqs)-* when we put p=kA/y and q=mv/y, and thus the 
given condition is equivalent to the statement that the median of this negative binomial 
distribution shall be greater than or equal to m. 








300 An age-dependent birth and death process 


7. THE CASE OF RANDOM DEATH: MEAN POPULATION SIZE 


In this case, putting m = 1, we have 


k 
b(t) [1 — D(t)] = (=) (y)**, 


v kA kA\*-1 
d(t)({1— Bit ->| 5 ws 4...4() anf 
(t) [1- B)] [tM vy) ™ 
The corresponding observable life-length distributions are obtained as follows: 


Bat) = (=) 197, whence B*(c0) = (=) 


and so B*(t| 6) = Lx (y)*-, 


which is of just the same form as B(t) with the parameter y = kA +v replacing kA. The effect 
of deaths occurring at random, on the observable birth rate, is merely to alter the constant 
in the distribution without altering the form of the distribution whereby the dependence 
of the birth-rate on the age of the parent is expressed. 


yp kA (eA\ k-1 
D*¥(t == lis +—1* t4...+(2) 1* |, 
(t) > (y) y (y) 7, (y) 
k-1 \k 
whence D*(a) = [1+ S44) = 1-() ; 
Y i Kf x 


andso D*(t|d)= “lt — =) T° I (+S 1a (yy wet (=) sa ’ 


Thus, as might be expected, although the actual death-rate is independent of the age of the 
individuals the effect of the age-dependent birth-rate is to produce an observable death-rate 
which is age-dependent. 

k 
The probability of extinction is equal to 1 if (=) < =: i.e. if v>k(2"*—1)A, while if 
v<k(2¥*—1)A, then 


k k 
Probability of extinction = ! ) | / (=) 


lp\* 


In the case of random death the integral equation for the mean population size may be 
written in the ‘convolution’ notation as 


kA\* LA\* p kA kA\*-3 
= *k 6% 8 ot k 
y(t) () y(t) * (y)** +1 (~) 1 *(y) = el (7) + 7 ot+...+(F) (y)* | 


where we have put 1—G(t) = 1—1*9g(t). 
Let us define a differential operator, 
id 
y dt’ 
so that OL f(t) * (y)] = f(d). 





opel 


whe 


Wh 


The 


Th 


If 


to; 


ot 
it 


e 





W. A. ON. WauGcH 301 


Put ¢=0 in the integral equation and in the relations resulting from applying the 
operators 6, 62, ...,d*-1 to it. This gives 


#,(0) = 1, 
kA 
du,(0) = y? 
k-1 
$*-1y,(0) = (=) 
whence it follows that 
d\i 5 a 
(5) Mmy(t)|} =(-vy (7 =9,1,...,e-1). 
‘=0 





While 6* gives the differential equation of the mean 


dkp,(t) = 2(S) att pied (=) +5 f rae” (“| 


The constant terms sum to zero so we get 


(0-39) n0-0 


The characteristic equation for the above differential equation is 


(4) 4) 
If w is the primitive kth root of unity this gives 
a= 2VkkAwi-y, where j =0,1,...,k—1. 
These roots lie on a circle in the complex plane, centre —y, radius 2“*kA. The root of largest 
pred parts x = WkkA—y = k(2Vk—1)A—p. 


To obtain an explicit solution for the mean population size it is convenient to consider the 
mean population size which results when there is no death effect (v=0). In this case the 
system reduces to the uniform multi-phase birth process described by Kendall (1948, 19526). 
If we call the mean population size w(t) when v = 0 we see that it satisfies the equation 


together with the end conditions 


pz(0)=1 and (5) ato =Q for j=1,2,...,k—1. 


The solution of this system is 


S 1 k-1  Qikyi = 
p(t) = 3k jy pikF — XP [k(2"*w — 1) At] 


(kAt)y*  (kAt)re+t (katyre+k—a 
(rk) * (k+l + * (rk+ R11] 








= enkat 5 2| 


r=0 











302 An age-dependent birth and death process 
It follows that if we put fy (t) = e nF (0), 


then ,(t) will satisfy the differential equation, together with the end-conditions, that we 
have derived for it above. 

The value of the mean population size for large t depends on whether the dominant root 
of the characteristic equation, i.e. the real root (2/*—1)kA—v is less than, equal to, or 
greater than 0. We shall consider these cases in turn. 

(i) v>k(2”*—1)A. The mean population size tends to zero like e~“, where « is a positive 
constant, and it follows from this or from the value of B*(oo) that the probability of 
extinction is 1. 

(ii) v = k(2"*—1)A. The probability of extinction is again equal to 1 but we obtain a 
finite limiting mean population size 

Duk 
f(t) >C,, where C, = 3k Qi 1)" 


k 
(iii) v< k(2"*—1)A. The probability of extinction is equal to ( 1+-— k =) —1<1. The mean 


population size will increase exponentially and we can write the following asymptotic 
expression for it: 


2Uk 
y(t) ~ 3f( gue — 1) °XP [{(2”* — 1) kA — v} #). 


8. THE CASE OF RANDOM DEATH: COEFFICIENT OF VARIATION OF THE POPULATION SIZE 


Using the convolution notation we may express the integral equation connecting the mean 
with the second factorial moment of the population size as follows: 


kA k f k 
a(t) = o(—) {y1y(t)}2 * (y)** + (=) pial t)* (y)**. 


By applying the differential operator 4 as in the case of the equation for the mean this 


reduces to 
ld kA\*\ _ 9(FA 
(1+ 5a) ~2(5) Jno = (5) nto? 
i.e. multiplying through by y* and factorizing the operator 


k-1/ 
TH (5 — 2or — 1) kav] aglt) = 2RAVE GO}. 
As with the mean, the behaviour of the second factorial moment as t > 00 depends on whether 
the dominant root of the characteristic equation is negative, zero, or positive, and we shall 
examine these cases in turn. 
(i) y>k(2”*—1)A. All terms in the solution for the second factorial moment will tend 
to zero faster than e~“, where a is some positive constant. 
(ii) v = k(2Y*—1)A. The term {y,(t)}* on the right-hand side of the equation tends to 
the finite limit C? as t->0o, while the equation is 
d *-lid 
ail ia — k(2¥ar — 1) A+v} pg(t) = 2(kA)* {4 (0)}?. 


r=1 





ve 


wn 


is 


W. A. O’N. Waucu 303 
Hence for large t we shall have an asymptotic solution: 


2(kA)* 
H(t) ™ £54 ( 
T] {y—k(2w" — 1) A} 
=1 


T= 





Cit 


= ee 

~ k(kA + v)k-} 
Hence the variance tends to infinity, but does so to the order of ¢ rather than e! and in this 
the behaviour of this age-dependent system resembles that of the simple Markovian birth 
and death process when the parameters A and v in it are equal. 

Note that as k > 00 we have the condition v = A log, 2+ 0-69315A for a finite limiting mean 
population size. 

(iti) »v< k(2"*—1)A. The dominant term in the complementary function will be of the 
order of exp[{k(2/*—1)A—v}t], but the particular integral will contain a term in 
exp |{k(2"* — 1) A—v} 2t] which will therefore give the asymptotic value of the complete 
solution. Formally we can write the particular integral as 

2(kA)* 

[a(t) - d‘ k 
bf Se k 
(y+5 ) 2(kA) 


2(kA)? 
H(t) si [k( 24+ — 1)A- v}* a Dhak a O. 


Ct. 





{uy (t)}?, 


and hence 





Let V,(t) be the coefficient of variation of the population size. Then 
H(t) l 


=> l 
Malt) {eat}? ay(t)—’ 


and if we introduce the asymptotic values for the mean and second factorial moment, we 
obtain for the coefficient of variation for large t 








V2 = lim V2(t) = ~—-1 
~ Vv 
ta (2-0 a ni) — 2 





where a; = k(2¥*—1). 

When k->0o, the process tends to resemble one in which deaths occur at random but 
fission occurs deterministically when the parent individual attains a certain fixed age equal 
to 1/A. The limit as k > of the above coefficient of variation, which is itself a limit for large 
t, is thus of interest. Noting that a,—>log, 2 and that }<e~”’ < 1, with equality only when 


v = 0 we obtain the limit 
: %) 9 1—e"A 
lim (7) = 55-1): 
If V, is the coefficient of variation of the distribution B(t), then V, = 1//k and so V,/V, = V, Jk 
will have a finite limit only if V, > 0, as k>oo. i.e. if vy = 0. This point will be of significance 


if any attempt is made to estimate V, from observations of V,, the latter being easier to 
observe, for instance, in bacterial populations. 











304 An age-dependent birth and death process 


9. BrrTH- AND DEATH-RATES BOTH AGE-DEPENDENT 


We have seen that when k>1 and m>1 the instantaneous probabilities of death and of 
fission both depend on the age of the individual considered. We will first investigate the 
behaviour of the mean population size under these conditions. The integra] equation for 
the mean population size is 


kA\* 
patt) = 2) wate [rree rE eons. 





+ (=) 1 K(k + ae m — 2) (prom 
+[1—B(t)} [1 —D(e)]. 
Now [1 — B(t)] [1 — Di) =e P(t), 
where P(t) is a polynomial of degree k +m—2 int. Hence 
gk+m-1] — B(t)] [1 — D(t)] = ( 1+ —) e-v' P(t) =0. 
y dt 


Thus, if we apply the operator d*+™—! to the above integral equation we obtain the following 
homogeneous differential equation for the mean population size 


d = } k+m—1 _ ¢ kA : m-—1 mv k m—2 


mv\™-1k(k +1)... (k+m— 2) we 
+(™) po Ol anit) = 0. 


The characteristic equation is F(x) = 0 and if we put y = 1+2/y this is 








Serge - Se. ae mv\"™"k(k+1)...(k+m—2)] _ 
By) = gms 2) |e igen (F) = meee ee 

It does not seem as easy in this case as it was in the case of random death (m= 1) toobtain an 
explicit solution of the characteristic equation. We will obtain an asymptotic solution for 
the mean population size, and will consider only the case when the probability of extinction 
is less than 1. The condition for the probability of extinction to be less than 1 is p<o and 
this is the same as 1 — 2B*(00) < 0, i.e. H(1) <0. Since H(y) +o as yo it follows that the 
characteristic equation H(y) = 0 must have at least one real root greater than 1. There is 
just one change of sign in H(y) so by Descartes’ rule of signs it follows that H(y) = 0 has 
a unique real positive root, say w, which is simple and greater than 1. 

We can show that w is larger than the real part of any other root of H(y) = 0. Letz=u+iv 
be any complex root. We can ignore real roots and complex roots with w<6. We have, 
rearranging H(z) = 0 


k m—1 a 
»(!) 1 mvk 1. (™) k(k+1)...(k+m—2) 1 \=1, 


whence, taking moduli, 


(“)'{ 1 mvk 1 (myn ee en) 1 \ 51 


y) \eR Ty Ter y 











(m—1)! | z |*+m—a 


W. A. ON. Wavucu 305 


Now if z is complex, v +0, and so | z| >w. The left-hand side, considered as a polynomial 
(in | z|-") is strictly increasing for any argument greater than 0, because all coefficients are 
positive, so if we substitute u—! for | z|—! we can drop the sign of equality. Thus 








2(=A)" 1 mk 1 (= mlk(k+1)...(k+m—2) 1 
* ") Fe y Llykti ' y (m—1)! yktm-1 
ay 2(= jl mk 1 (mye em) 1 
"7 7 tthe Bs crctartiggt ; (m—1)! what} 


whence w> u. 

It follows that F(x) = 0 has a root r which is unique, simple, real and positive and larger 
than the real part of any other root of F(x) = 0. Hence we have, as the asymptotic expression 
for the mean population size 

My(t)~ce", 
where c is a constant which could in theory be evaluated, though to do so we should require 
the other roots of the characteristic equation, and the appropriate end-conditions. How- 
ever, it is not essential to evaluate this constant, because it will disappear from the limiting 
form of the coefficient of variation of the population size. 


To write down the differential equation for the second factorial moment of the population 
size we note that 


k+-m—-1 k \m-1 m—2 
(1+) ~ F(x) = (=) {(1+2) me (+2) 
y Y y yivy 


(mre 1)... (k+m— 2)) 
to. Hi j . 
Y (m—1)! J 
The integral equation for the second factorial moment is similar to the one which arose 
in the case of random death, but it contains more terms. 


It can be reduced to a differential equation in just the same way, by applying the operator 


é= (1 + a) , and the result is 


y dt 
r(5) f(t) = i(2 5) s: F(5)| {a (t)}?. 





We can see from the solution of the equation for the mean that the complementary function 
for this equation will have a dominant term of the order of e”, where r is the same dominant 
root of the characteristic equation. However, owing to the appearance of {w,(t)}* on the 
right-hand side we see that the particular integral will contain a term of the order of e?” 
which will be the dominant term in the complete solution. Evaluating this term we obtain 
D k+m-1 _ D 

adt)~ (1+ eae F(2r) 
Note that the denominator F(2r) +0. 

It follows that the coefficient of variation of the population size for large ¢ is given by 





c? e2rt, 


,; (1+ 2r/y)eem- 
72 2(¢) = ates 
Vn = lim Val!) = “—“Fer) 
In conclusion I should like to thank Mr D. G. Kendall for his continued help and 
encouragement throughout the preparation of this paper. 











306 An age-dependent birth and death process 


REFERENCES 


BELLMAN, R. & Harris, T. E. (1948). On the theory of age-dependent stochastic branching processes. 
Proc. Nat. Acad. Sci., Wash., 34, 601-4. 

BELimMaNn, R. & Harris, T. E. (1952). On age-dependent binary branching processes. Ann. Math, 
55, 280-95. 

Eruane, A. K. (1948) The Life and Works of A. K. Erlang. Edited by E. Brockmeyer, H. L. Hal- 
strom and Arne Jensen. Copenhagen: Trans. Danish Acad. Tech. Sciences, 1948, No. 2. 

FELLER, W. (1939). Die Grundlagen der Volterraschen Theorie des Kampfes ums Dasein in wahr- 
scheinlichkeitstheoretischer Behandlung. Acta biotheor., Leiden, 5, 11-40. 

FELLER, W. (1950). An Introduction to Probability Theory and its Applications. New York: John 
Wiley and Sons, Inc. 

Harris, T. E. (1951). Some mathematical models for branching processes. Proc. 2nd Berkeley Sym- 
posium on Mathematical Statistics and Probability, pp. 305-28. 

KENDALL, D. G. (1948). On the role of variable generation time in the development of a stochastic 
birth process. Biometrika, 35, 316-30. 

KENDALL, D. G. (1952a). On the choice of a mathematical model to represent normal bacterial growth. 
J. R. Statist. Soc. B, 14, 41-4. 

KENDALL, D. G. (19526). Les processus stochastiques de croissance en biologie. Ann. Inst. Poincaré, 
13, 43-108. 

PowELL, E. O. (1955). Some features of the generation times of individual bacteria. Biometrika, 
42, 16-44. 

RAMAKRISHNAN, ALLADI (1951). Some simple stochastic processes. J. R. Statist. Soc. B, 13, 131-40. 

Rep, A. T. (1953). An age-dependent stochastic model of population growth. Bull. Math. Biophys. 
15, 361-5. 

Rossrns, H. (1948). Mixture of distributions. Ann. Math. Statist. 19, 360-69. 











[ 307 ] 


A LARGE-SAMPLE BIOASSAY DESIGN WITH RANDOM DOSES 
AND UNCERTAIN CONCENTRATION* 


By FRED C. ANDREWS anp HERMAN CHERNOFF 
The University of Nebraska and Stanford University 


1. THE PROBLEM 


The problem discussed here is that of designing an experiment for the purpose of estimating 
a parameter in a dose-response curve when the doses administered cannot be known with 
exactness. 

To introduce the problem let us consider the following extreme example. A biologist has 
developed a strain of bacteria. He believes that this strain is so virulent that a dose of one 
organism applied to a test animal will lead to a response with a probability of the order of 
magnitude of 0-2. He wishes to estimate the virulence of this strain of bacteria more 
precisely. For experimentation he has available thirty test animals and 10 ml. of material 
containing this strain of bacteria in suspension. The concentration of bacterial organisms in 
this material is about four organisms per millilitre, but this latter number might be un- 
reliable. The immediate problem confronting the biologist is one of design. Since he is not 
sure of the concentration he must use a portion of his material for a plate count (Clifton, 
1950, pp. 243—4) to estimate the concentration. Then he must allocate portions of his 
remaining material among the test animals to determine virulence. The amount of material 
to be used in the plate count cannot be too large, for then the remainder will not be an 
adequate amount of dose material. Some balancing allocation between these two parts of 
the experiment must be reached or the entire experiment will not yield its maximum 
amount of information. 

For the sake of the reader who prefers to peek at the last page of a mystery story, the 
optimal design corresponding to the biologist’s problem (when completely formulated) is 
the following. To determine concentration 3-1 ml. of material are used. The remaining 
6-9 ml. are divided equally among the thirty test animals. 

According to this design about twenty-eight organisms are divided among thirty animals. 
It is obvious that not all of the animals are assured of receiving organisms. In fact, the 
dosage is random, and some animals might receive none, some might receive one and still 
others might receive two or more organisms. This points out that in our problem exact doses 
are not known for two reasons. First, the concentration is at best an estimate based on the 
limited amount of material assayed. Secondly, even if the concentration was known, the 
exact dose administered to an animal is a random variable which the biologist cannot 
observe directly. This second situation will be referred to as dosage subject to error. Dosage 
subject to error occurs frequently in practice but is seldom treated in theory (see Haley, 
1953), although such theory is especially relevant when the doses are small. 

The problem facing the biologist can be formulatéd briefly by stating that the fractions 
fo. fis fa» -+-» fgg Must be determined with the understanding that fy is the fraction of the 


* This work was sponsored by the U.S. Army, Navy and Air Force through the Joint Services 
Advisory Committee for Research Groups in Applied Mathematics and Statistics. 











308 A large-sample bioassay design 


material to be used in the plate count and f; the part to be used to dose the ith test animal. 
Of course these fractions are to be determined in such a manner that an ‘optimal’ design 
for estimating the dose response curve will result. 


2. MopEL 


In order to formulate completely this design problem it is necessary to specify carefully 
the nature of the probability distributions involved. 

First, the dose-response curve must be specified. In this paper, we shall treat only the 
case where our biologist has reason to believe that if d organisms are administered to a test 
animal the probability of no response is given by 


Pa = (1—a)*. (1) 
The parameter « represents the probability that a dose of one organism will lead to a positive 


response. This dose-response curve is said to be exponential because the dose d appears in 
the exponent. This formula may also be written 


Pq = et 108-2), (2) 


This model has previously been discussed in the literature by Goldberg & Watkins (1952), 
Druett (1952) and Peto (1953), and in a different context by Cochran (1950). wn 

Secondly, we must consider the distribution of the number of organisms appearing in a 
certain amount of the material. Frequently the law of small numbers is assumed in this 
situation (see Worcester. 1954) yielding: If the available material is taken from a source 
where the concentration is c organisms per millilitre, the number of organisms x appearing in a 
sample of r millimetres is a Poisson random variable with mean cr. That is, 


P(x=i) =e 7 mt (a = 0, 1,2, om (3) 


Furthermore, if several samples are taken from the source (without replacement) the number of 
organisms in these samples are independently distributed. 

As an example. suppose that one-half of the tube of 10 ml. were used in the plate count. 
Suppose also that the concentration of the source was exactly 4 organisms per millilitre. 
Then the number of organisms in the part plated would be a Poisson random variable with 
mean (expectation) $.10.4 = 20. When the dose d is a Poisson random variable with 
expectation D, we shall say that D is the nominal dose. Then 


P(d=i) =” (i = 0,1,2,...). (4) 

We might repeat here that in our problem the biologist does not know the concentration 

and must use the plate count to estimate the concentration of the source or equivalently the 

expected number of organisms in the test tube. Let A be a notation for the expected number 

of organisms in the test-tube. If then the biologist uses a fraction f of his material as a dose, 
the nominal dose is fA. 


3. THE EFFECT OF ERROR IN DOSE 


As was pointed out in the introduction, the exact number of organisms in a dose is a random 
variable. Ordinarily, as in the case with the probit or logit models, the randomness of the 
dose very seriously com plicates the mathematical situation, as is quite evident from reading 





—_—_ 





1€ 





a 


— 


Frrep C. ANDREWS AND HERMAN CHERNOFF 309 


Haley (1953). It is, however, fortunate that with the use of the exponential model this 
complication is minimized. 

Suppose that a nominal dose D is applied, then the probability of a negative response is 
given by Py 
Pp = » P(d=1)(1—a)* = e-*”. 


Contrast this with the probability of a negative response when the exact dose is d, namely, 
Pa = (1—a)4 = elloeed—aid, 


It is fundamental here to see that the response curve for a nominal dose is also exponential 
in form. The only difference in the two curves is that log, (1 —«) in the exact dose curve is 
replaced by —a in the nominal dose curve. (For small «, log, (1 —«) is almost equal to — «.) 


4, THE NATURE OF AN ASYMPTOTICALLY LOCALLY OPTIMAL DESIGN. AN EXAMPLE 


The problem we have posed in § 1 yields a solution which might be termed a large-sample 
locally optimal design. To illustrate what we mean by these terms, we shall use a simpler 
problem which has already been treated. 

Suppose that it is desired to dose a large number of test animals with organisms from a 
material with known concentration and that there is an unlimited supply of organisms 
available. This problem reduces then to selecting nominal doses D,, D,, Ds, ... to administer 
to the animals. Once such nominal doses are selected and observations obtained, one may 
apply maximum-likelihood or some other asymptotically efficient estimation technique. 
It is known (though not trivial) how to apply the maximum-likelihood method of estimation 
and how to obtain the asymptotic variance of the resulting estimate of «. By an asymptotic- 
ally optimal design we mean one which minimizes the asymptotic variance. It turns out 
for the exponential model that the asymptotic variance is minimized when all the doses 
are the same and these are equal to 





Here the local character of the problem becomes evident. That is, the asymptotically 
optimal design to estimate a depends on a. But it is clear that if « were known, there would 
be no point to the experiment. This paradoxical situation is resolved by the following 
consideration. In practice, the biologist will have a rough idea of the order of magnitude 
of «. Suppose he felt he could guess « of the order of magnitude of 0-2, then he could use the 
dose 1-6/0-2 in place of 1-6/a. It can easily be shown that the use of the wrong dose level 
increases the asymptotic variance of the estimate of x. However, this increase is very 
gradual. If we measure the dose in terms of percentage of the asymptotically optimal dose 
and the corresponding variance similarly we obtain the following table: 


| 








Dose/optimal dose | Variance/optimal variance 
are 
| 0-50 1-25 
0-75 ; 1-02 
1-00 | 1-00 
1-25 1-02 
1-50 | 1-12 














310 A large-sample bioassay design 


It is clear that if the biologist has any reasonable educated guess about a he could use the 
design which would be optimal if his guess were the correct value of «. If, however, his best 
guess might be off by a factor of three or more, alternative procedures would have to be 
considered. For example, he might use a small preliminary experiment involving several 
dose levels which are very highly spread out. Using the estimate of a based on this pre- 
liminary experiment he could then apply the locally optimal design. 

The gradual increase in variance divided by optimal variance shows not only the applic- 
ability of the locally optimal design. It can also be used to show that designs where dose/ 
optimal dose takes on several values from 0-75 to 1-25 are almost optimal. While such 
designs lead to more complexity in the calculation of maximum-likelihood estimates they 
are useful to biologists who have reservations about the applicability of the exponential 
model to their problem. 

5. RESULTS 


As may have been expected, the large-sample optimal solution to our design problem in- 
volving an assay is a local solution. In fact, it will be shown in the appendix that the solution 
has the following characteristics: 

(1) Every test animal is given an equal fraction of the test-tube. According to the dictates 
of the mathematical model there is no advantage in limiting the number of animals used to 
allow larger individual doses. Of course costs of obtaining individual animals would fore- 
stall the use of an unlimited number of animals in any one experiment. 

(2) The fraction of the test-tube administered to test animals is given by f° which is a 
function of « (probability that one organism will cause a positive reaction), A (the expected 
number of organisms in the test tube), and s (the number of test animals available). 

This functional dependence is rather complicated, but f° is very well approximated by 
£* which is obtained from the following simple formula: 





on 1 1-68 
fi’ = P* = minimum (; rn ah) ‘ (5) 
Hence if the numbers given in the statement were relevant we would have 
] 1-68 
Ax40, ax0-2, = 30, ——w069, ——w6, * ~ 0-69. 
id "1+ Ja ar B 


A remark about how /* was obtained might be pertinent here. If a very large num ber of 
organisms were available it would be desirable to give each animal a dose of about 1-6/a 
and to use the large amount remaining for the plate count. Then the fraction of the test-tube 
used for doses would be about 1-6s/(~A). However, if there are very few organisms available, 
it turns out that /a/(1+/«) is approximately the right fraction to use in plate counting, 
leaving the fraction 1/(1+ /a) for doses. 

(3) The local nature of the solution is not a serious handicap to applications. 

As in the example treated in §3, the asymptotic variance undergoes a relatively small 
change even when the fraction devoted to doses is changed by a considerable amount. This 
can be seen by referring to Figs. 1 and 2 and Table 1. Table 1 gives #° as a function of « 
and A for s = 30. In addition, there is presented £, 8 and o7(@), where f < f° < P. 

& and # are fractions devoted to doses which yield designs which are 80%, efficient 
(efficiency measured in terms of asymptotic variance). o7(&) is the asymptotic variance 
using /°. The relatively large spread between f and f is an indication of the wide applic- 
ability of the local solution. iy 


Of 





or of 
“6/a 
ube 
ble, 
ing, 


mall 
This 
of a 


ient 
ince 
slic- 


value N 
: \ 


Frep C. ANDREWS AND HERMAN CHERNOFF 311 


In Fig. 1, £, 6°, £* and f are drawn as functions of A for a = 0-10 and s = 30. Similarly, 
in Fig. 2, 8, 6°, 2* and # are obtained for a = 0-04. These figures indicate the goodness of 
the approximation of £* to f°. 

10 




















N Upper 80 % efficiency limit 


08 
Re - iF MA A ob 
“a | opcimat NX \ 









































































































































































































































0-4 
NX — 
al NY 
= > 
02 a 
mS PM 
Lower 80 % efficiency limit — 2. 
| ee _——}-— 
(4) 1 1 I 
1000 2000 4000 8000 A 16000 32000 64000 
Fig. 1. Comparison of # values as a function of A. 
" a@=010 
s= 30 
08 4 _< Upper 80% efficiency limit 
! 
’ Se (= a. \ | 
KK Ye approximation 
06 Optimal value N \ 
B | N N 
-——_ 
rm ~ 
0-4 ~— 
IN NY NN 
/\ f |S +; = im “ie 
02 Lower 80 % efficiency limit ——~ 
ee = 
aan Pot Bets == 
1 ri 
° 100 200 400 800 1600 3200 6400 


Fig. 2. Comparison of # values as a function of A. 


(4) The solution provides a very simple computational method of estimating a. 
If all the test animals are given the same dose D, the proportion ) of test animals which 
do not react is the maximum-likelihood estimate of e-*”. Hence, the estimate of « is given by 


a= ~D 0g. Pp = D S10 P- 


‘i Biom. 42 














A large-sample bioassay design 


312 















































Bae 4 AS ¢ ce Ba at EB oe an +S es 
“9 Jo aoueva onoydurdse umnurtutut (x),0 “grunt Aouatoyye % 0g seddn ‘¢ 
‘g yo onqea yeurnjdo ‘yg “yrumy Aouaroyyo % 0g somo] ‘gf :Aoy » 
000‘00T | (goeses-0) OF | (99960¢-0)+-0r | (8szt-0)s-o1 | (s1¢-0).-0r | O00'00T 
| 20-0 10-0 10-0 | 60-0 20-0 20-0 | ST-0 OT-0 90-0 | 81-0 8F-0 92-0 
| 
000‘0¢ | (LGP8E8-0) »-OT | (02Z902-0) »-OT | (FL8ZT-0) s-OT (8T¢-0).-OT | 000‘0S 
| 60-0 20-0 20-0 | 0-0 90-0 80-0 | 1€-0 61-0 11-0 | — 260 I¢0 
ooo'or | (¢e660g-0)s-Ot | (gez9IT-0)s-OF | (T6E9TS-0)e-OT | (OOTOGE-0)e-OT | (O69SST-0) e-OF | (9989¢9-0)»-01 | (E1F90z-0)-OT | (9ZTET-0) e-OF | (0Sz1-0)e-0t | 000'0T 
40-0 20-0 20-0 | 90-0 £0-0 20-0 | 10-0 90-0 £0-0 | 60-0 90-0 40-0 | £T-0 80-0 90-0 | 6I-0 ZT-0 10-0 | 6€-0 FBO EI-0 | 66-0 18-0 08-0 | — 160 1L0 
000°S | (816902-0)s-OT | (SeS9TT-0)s-OT | (T9SLTS-O)e-OT | (FL90EE-0) e-OT | (9ST9BT-0)e-OT | (6SLLE9-0) OF | (9GFL0Z-0)»-O1 | (9S69T-0)s-O | (S0EZ-0)»-01 | 000'S 
10-0 0-0 €0-0 | OI-0 90-0 40-0 | ST-0 OT-0 90-0 | 6I-0 ZT-0 10-0 | 92-0 91-0 60-0 | 6E-0 FZ-0 EI-0 | LL-O SF-0 92-0 | 86-0 $6-0 19:0 | — 160 8L-0 
000‘ (98E10Z-0) s-OT | (LELOTT-O)s-OT | (9898TS-0)e-OT | (980ZEE-0)e-OT | (89698T-0) c-OT | (LEPZES-0) »-OT | (9S9TTZ-O) OT | (LO0FZ-0)s-OT | (OZLE-0)e-OL | 000'S 
€1-0 80-0 $0-0 | LI-O IT-0 90-0 | 92-0 9T-0 60-0 | 28-0 02-0 IT-0 | €F-0 9-0 SI-0 | $9-0 OF-O 22-0 | 96-0 SL-0 ZF-0 | 86-0 86-0 01-0 | — 160 8L-0 
| 
oos‘z (16L0Z-0) s-OL | (TL89IT-0) c-OT | (FOO6BTS-0) e-OT | (S8LZEE-0) e-OT | (9IPLET-O) e-OT | (LELSES-O)»-OT | (69FLIZ-O) »-OL | (S69LZ-0)e-OL | (8ZFF-O)e-OT | 00S‘ 
ST-0 OT-0 90-0 | 02-0 SI-0 10-0 | TE-0 61-0 IT-0 | 68-0 #20 ST-0 | 29-0 26-0 LI-0 LL-O L¥-0 92-0 | 16-0 I8-0 8F-0 | 86-0 €6-:0 IL-0 | — 160 61-0 
000° (09T80Z-0) s-OT | (OSTLIT-O)c-O1 | (L6ZTZS-0) e-OT | (T66EEE-0) e-OT | (LEZS8T-0) e-OT | (OZOEFS-0) »-OT | (SFETEZ-O)»-OT | (OOEEE-0)c-OT | (Z6FS-O)e-OI | 000‘ 
61-0 ZI-0 10-0 | 92-0 9T-0 60-0 | 6E-0 FZ-0 ET-0 | 8F-0 0E-0 9I-0 | F9-0 6E-0 22-0 | 06-0 8-0 ZE-0 | 16-0 $8-0 FS-0 | 86-0 £60 21-0 | — 160 6L-0 
008‘I (Z90602-0) z-OT | (6TLLIT-O) s-OT | (ZZSFZS-0) e-OL | (66F9EE-0) c-OT | (9PEOGT-0) e-O1 | (E9F998-0) »-OT | (6192-0) »-OL | (ZELZF-0) s-OT (¥9ZL-0) s-OT | 00S‘T 
92-0 91-0 60-0 | FE-0 12-0 ZI-0 | 29-0 BE-0 LI-0 | #9-0 6E-0 12-0 | 8-0 ZS-0 82-0 | 6-0 ZL-0 OF-0 | 96-0 98-0 6S-0 | 86-0 86-0 $10 | — 160 61-0 
000‘T (OSTT1Z-0) s-OT | (OZIGTT-0) s-OT | (699EES-0) e-O | (6ZLFFE-0) c-OT | (ES066T-0) e-OT | (98ZEL6-0) »-OL | (SIEZFE-O)»-OL | (SZLI9-0) e-OF 000‘T 
68-0 2-0 ST-0 | TS-0 ZE-0 LI-0 | $L-0 LEO SZ-0 | 18-0 L¢-0 TE-0 | 26-0 OL-0 OF-0 | S6-0 08-0 IS-0 | 96-0 18-0 29-0 | 86-0 €6-0 £10 
00¢ (8980Z-0) s-OT | (660LZT-0)e-OT | (10Z809-0) e-OT | (LOLLTF-0)e-OT | (1L999%-0)e-OT | (LESLFT-0) OT | 00s 
EL-0 SF-0 SZ-0 | S8-0 LS-0 ZE-0 | 06-0 IL-0 ZF-0 | 16-0 SL-0 LE-O | €6-0 61-0 29-0 | 6-0 8-0 8S-0 
0Sz (S91,19-0) s-OT | (S80L9T-0)2-OT | (G06006-0) e-OT | (ZT#FS9-0) e-OL | (6980FF-0) e-OT 0Sz 
G8-0 S90 86-0 | 88-0 OL-0 FF-0 | 06-0 SL-0 0S-0 | 16-0 LL-O €S-0 | 26-0 08-0 99-0 
00 (690L6F-0) e-OT | (9ZTTEE-0)s-OT | (PIFTGT-O) s-OT | (TOTEFT-0)z-OT | (ZEEE66-0) c-OT (x),0 _—'| OOT 
G80 69:0 SFO | 18-0 ZL-0 8F-0 | 06-0 91-0 ES-0 | 16-0 8L-0 S¢-0 | 26-0 08-0 8¢-0 J of Gs 
j ING 
4 4 
*4 02-0 ST-0 0-0 80-0 90-0 40-0 020-0 [00-0 100-0 = 
yA | 2 \ 
sppuiun 489) OE = 8 fo asno ay) 40f 0 fo 20unr4na D4O}dwUhsy wnMIUIUM ay) puDd sanzoa gf SnoLLDA “| BGR, 











Frep C. ANDREWS AND HERMAN CHERNOFF 313 


Should a more complicated design be used, the calculation of the maximum-likelihood 
estimates would be much more difficult and involved. (It must be pointed out that if there 
are serious doubts about the applicability of the exponential model, the biologist cannot 
rely on a design giving each test animal the same nominal dose.) 


The computations and figures in this paper were prepared under the supervision of 
Aloise Askin and Gladys Garabedian. 


MATHEMATICAL APPENDIX 
Allocation of doses 


Denote by £ the fraction of the bacterial material reserved for the dosage part of the 
experiment. Let f; be the proportion administered to the ith animal. Then 


B _ Eh 


and the nominal dose received by ith animal is /; A. 

Since our design is to be locally optimal, we shall define the optimal design as that one 
which provides the maximum information relative to « (the information matrix defined 
by Fisher is given by 

x= ~ 6 (eee) = (Hein Coe) 
00; 06; 06; 06; ; 
where f is the probability density and 6 = (a, A) is the vector parameter to be estimated). 
Since the random variable Z (the number of bacterial colonies developing in a culture 
medium) observed during the plate count has a Poisson law of probability with expectation 


(1—£)A, it follows that 
0 0 
A? ad\ 8 fp, 
X= {= +( ) > —, 6 
(, 1-5) ark a®/ ii 1—p; se 








where p,; = ei, 
For the moment let us assume the following properties of the function 
uze-¥ 
a 7 
(4) = (7) 
for positive w. 


Lemma 1. (a) The equation g’(w) = 0 has a unique positive root at u~ 1-6 (the corre- 
sponding value of e-" is 0-8), and (5), g(w) is concave for 0 <u < 3-08. 

It follows immediately from (a) and (1) that for an optimal design all f;~A should be less 
than 1-6. From (6) it follows that if 0 <u; <3 for alli 


ee: 1.2 
of, bw) >; ole) (8) 
i 8i=1 
and therefore we have: 


THEOREM |. For the model considered here, all the animals should be given the same dose. 
It remains only to prove the lemma and to determine the optimal /. 
Proof of Lemma. Part (a): 

2ue—* u®e-™ 


t= te (9) 





2I-2 











314 A large-sample bioassay design 
It suffices to show that the equation 
(1—e-“) — 4u = 0, (10) 


has a unique positive root. But this follows from the fact that the expression on the left is 
concave and vanishes at u = 0. 
Part (6): 
g’(u) = (e" — 1)- [e™ (uw? — 4u + 2) + e” (u? + 4u — 4) + 2]. 


Expanding e¢*" and e” in a Taylor series we get 


with a; = j(j — 9) 2/-* + 4(2/-* — 1) + j?, so that. a; < Oifj = 2,3...., 8anda,;>0ifj = 9,10..... 
From this it follows that g”(w) has only one positive root (which is computed to be approxi- 
mately 3-086) and (5) follows. 

Suppose now that each animal is given the nominal dose fA/s. The maximum-likelihood 
estimates of A and « are 





A Z 
amet (11) 
a 2 ~ —sl1—£ A 
and & =—~ log) = —— log D, 12 
Bi oP = SP (12) 


where Z is the number of colonies developed during the plate count and # is the proportion 
of the animals which did not respond. 
Studying @ directly or applying (6) it follows that the asymptotic variance of & is 





, at s (l—p) 
= 0p)" Pp 
The optimal value of # can be tabulated as a function of a and A. Hence designs of this 
type can only be hoped to have local optimal properties. However, in many situations 
some a priori knowledge of A, « is available which should enable one to get a fairly efficient 
design. 
Examining o*(@) as a function of £ we see that for small values of aA/s 


p= e—haals (13) 


‘ s [BA | a? ajl«@ |\ 
o*(&) =a, |— a| +, = = |54+~-—5 > 
ns ee A=A) ~ A\B* I= By 
which is a minimum for £ = 1/(1+ Ja). 
For large values of A there should be enough material to give each animal a dose of approxi- 
mately ED,o, which is the optimal single dose as can be determined again from the in- 
formation function. If this is the case, setting 


fAa/s = 1-6 


will yield an approximately optimal value for # when A is large. For an overall approxima- 
tion to optimal f we propose setting 


oe foe 1 (1-6)8 
f= pr = min| SY. 


tl 


WwW 


10) 


)X1- 


pod 
11) 
12) 


sion 


(13) 
this 


ions 
ient 


OXxi- 
» in- 


ima- 


Frep C. ANDREWS AND HERMAN CHERNOFF 315 


Table 1 gives, for various values of «, A, the actual value of # minimizing o°(@) for s = 30, 
the upper and lower limi*s of the values of # giving at least 80 °% efficiency, and the minimum 
value of o*(@). If Table 1 is to be used for values of s other than s = 30 the conversion formula 


o*(&|8,A,a, ) = yo*(& | ys, yA, a, ) (14) 


is useful, where o7(@|s,A,«, 2) is the asymptotic variance of & when s, A, a, f are given. 
For example, if s = 60, A = 5000, a = 0-04 and the minimum asymptotic variance and 
optimal f are desired, we see from Table 1 that for s = 30, A = 2500, « = 0-04, the optimal 
fis f = 0-47. From equation (14) we find that 


o*(2|s = 60, A=5000, a=0-4, 8 = 0-47) 
= $0°(%|s=30, A= 2500, «=0-4, 8 =0-47) 
= 10-4(0-835757). 


The optimal value of £ is 0-47 and the upper and lower 80 % efficiency limits on # are 0-77 
and 0-26, respectively. 

If £ = f* is used as an approximation to the optimal value of f, we see that a design of at 
least 80% efficiency for all a, A and an extremely efficient design for small A and large A 
result as can be seen in two instances in Figs. 1 and 2. 


REFERENCES 


Currton, C. E. (1950). Introduction to the Bacteria. New York: McGraw Hill Book Co. 

Coceran, W. G. (1950). Estimation of bacterial densities by means of the ‘most probable number’. 
Biometrics, 6, 105. 

Druett, H. A. (1952). Bacterial invasion. Nature, Lond., 170, 288. 

GotpBer@, L. J. & Watkins, H. M. S. (1952). A mathematical expression regarding mortality in the 
albino mouse to respiratory exposure with pathogenic micro-organisms. Bact. Proc. Amer. Soc. 
Bact. p. 74. 

Hatey, K. D. C. (1953). Estimation of the dosage mortality relationship when the dose is subject to 
error. Unpublished Ph.D. dissertation, Stanford University Library. 

Pero, S. (1953). A dose-response equation for the invasion of micro-organisms. Biometrics, 9, 320. 

WorcesTER, J. (1954). How many organisms? Biometrics, 10, 227. 











[ 316 ] 


AN EXACT TEST FOR CORRELATION BETWEEN TIME SERIES 


By E. J. HANNAN 
Australian National University, Canberra, A.C.T. 


1. INTRODUCTION 


In a previous paper, Hannan (1955) in the process of obtaining an exact test for the serial 
correlation in the residuals from a regression, an estimator of the regression coefficient was 
also derived whose distributional properties, when the residual is a simple Gaussian Markoff 
process, are the same as those of the estimates of the regression coefficient in the classic 
cases. It was then shown that when the regressor is also generated by a simple Markoff 
process the regression coefficient will, for certain values of the serial correlation of the 
residual and the regressor, be a more efficient estimator than that obtained by straight- 
forward least squares. It follows that in these cases the test of significance of the regression. 
coefficient provides a test of the correlation between the two series which is asymptotically 
more powerful than the approximate test based on the sample correlation coefficient. 

Alternative approximate tests for correlation between two series, in the form of partial 
correlations with the effects of lagged values of one or both variates removed, were pro- 
posed by Quenouille (1949). In the next section of this paper the efficiencies of Quenouille’s 
tests are compared with those of the two above-mentioned tests when the residuals from the 
regression come from a simple Gaussian Markoff process. 

In later sections the efficiencies of these tests are considered in other cases, some of which, 
from the point of view of a test for correlation, appear more interesting. 


2. THE EXACT TEST WHEN THE RESIDUALS FROM THE REGRESSION COME FROM A SIMPLE 


MARKOFF PROCESS INDEPENDENT OF THE REGRESSOR PROCESS 


We will consider the regression 
Y= a+fute (t=1,...,n), (1) 


where (1) €, = p,¢_, +7, and 7, is N{0, 7,(1 —p3)}, (2) ¢, is independent of 2, for all ¢ and s. 
In a previous paper (Hannan, 1955) it was shown that an estimator of # could be obtained 
by considering the regression 


py 


Yu = af1— 


p 
+ i+p (You-a + Yours) + Poy 


Px 
3 T+ ppP as +Xy.s)+G (¢ =1,...,[4(n—-1)]). 

Here ¢, comes from a process of independent random variates with zero mean and 
variance o7(1 — p?)/(1 +p?) and [}(m — 1)] is the greatest integer less than or equal to 4(n -- 1). 
The estimator, b,, of # obtained from the sample regression coefficient of x,, will have the 
usual properties of least-squares estimates on the normal case so that the corresponding 


e 


v 
Q 
c 
k 
I 
e 
t 
C 
| 
é 
I 





~ Fa 


LE 


ls. 
ed 


E. J. HANNAN 317 


partial correlation between y., and x,, will provide an exact test of the hypothesis that the 
two series are uncorrelated. 

When 2, also comes from a simple Markoff process, with parameter pg, (|p| <1), the 
variance of the estimator, b,, was shown to be asymptotically 


20%(1 — pj) (1 +3) 
no3(1 + p3) (1 —p3)" 





Here o3 is the variance of z;. 
It is well known (Wold, 1953, p. 211) that the variance of the straightforward least-squares 
estimate, b,, of # in the regression (1) is, asymptotically, 


OF(1+/pyP2) 
no3(1 — p; pe) 


Alternative tests for correlation between two serially correlated series were proposed by 
Quenouille (1949). In the case of correlation between two Markoff processes he recom- 
mended the use of 

1(%2Yo|%y), T(%2Y2/Y%1) OF 1(XeYe| 241); 


where 7(x,¥.|%,), for example, is the partial correlation between 2, and y, with the effects 
of x,_, removed. Asymptotically such statistics will have variances n-1, when there is no 
correlation between the two series, since at least one of the series of residuals which are 
being correlated will approach independence. Quenouille showed, by a sampling experi- 
ment, that when the two series are uncorrelated the variances are approximately n-', 
even for small samples, and the bias is small. The best statistic, on these grounds, appeared 
to be r(x_y_|2,y,). However, the efficiency of the statistics, as test statistics, depends not 
only on their variance on the null hypothesis but also on their distribution when the null 
hypothesis is not true. In particular, their expected values are relevant. A criterion of the 
asymptotic efficiency of a test based on a statistic t, of an hypothesis specified by a para- 
meter value 0, is provided by the limit as n > 00 of the quantity (Stuart, 1954; Mood, 1954) 


fearon 
var (t) O=O, : 


(2) 


The limit of the ratio of these quantities for two statistics t, and t, has been called (Pitman, 
1948) the asymptotic relative efficiency of the tests and may be denoted by E(t,,t, | 9) 
(where the expression (2) for ¢, is in the numerator). 

As mentioned, Quenouille was considering the correlation between two Markoff processes 
when he proposed his statistics. In the present case y, will not be Markovian if p,+p, and 
£+0. It is, however, of interest to examine Quenouille’s statistic in the present case also. 
(It is unlikely that, in practice, partial correlations of higher order than r(x. | 2,y,) will 
be used.) 

hele lim [aetr Xo Yo | “)) = lim cov {r(x2y_| 21), y’Ty*x}, 

n> =0 no. 
when f is zero (so that the y, and 2, processes are independent). Here y and x are vectors of 
the n elements y, and x, and I’, is the covariance matrix of the residuals ¢,. The differentiation 
under the integral sign is justified by the uniform convergence of the resulting integral. 











318 An exact test for correlation between time series 


The limit of the covariance can be evaluated by straightforward methods (making use of 
an obvious extension of the theorem quoted in Cramer, 1946, p. 353) and gives 


1 —p3)t 
jim a {6 (r x 2x } ‘i 72(1— p32)" ‘ 
op ( 2Ye| 1) p= O71 
habe: 7 (1—p3\* 
Similarly bzce 6 (r(242 | m0) oe o1 (4) , 


Since these two statistics have the same limiting variance on the nuil hypothesis 
1(%_Yq| X,Y) is always asymptotically the more efficient. 
We now have 2 
2(1 — p42) (1— pj) (1 =f 








1 1—p? 
Ef{r(x2Yz | X,Y3), b2| 2=0} = EAONe a 


1+ 
Ef{r(%2Yz | X,Y), 1 | f=0} = 2(5 +i). 





The first of these ratios was shown for certain values of p, and p, in Table 3 in Hannan 
(1955).* The second ratio is shown for the same values of p, and p, in Table 1. 


Table 1. E(r(xgy_|2,y), me 0) 























\ | os | -0-6 | -o4 | ~o8 | 0  oO2 | 0-4 | 0-6 0-8 
| | | | | | 

0 | 278 | 156 | 119 | 104 | 1 ' 1-04 1:19 | 1-56 2-78 

0-2 1:93 | 1-18 0-97 | 0-92 0-96 1-08 1:34 | 1-91 3-68 

0-4 | 1-20 | 080 | 0-72 0-75 0-34 | 103 | 1:38 | 214 | 453 

0-6 | 062 | 0-47 | 0-47 0-52 0-64 0-85 | 1:24 | 213 | 5-06 
| O8 | O22 | O20 | 0-22 | 0-27 0-36 0-52 | 0-83 1-60 | 4:56 

| | | | 





It is evident that, while 5, will be asymptotically more powerful than 6, when p, is suffi- 
ciently high it is always asymptotically less efficient than r(x, 4, | x,y,). It has the advantage, 
however, of giving an exact test. 

O* log L 
op? 
{nog(1 + pi— 2p1p2)}/{o7(1 — pi} 





It is easily shown that when 2, is Markovian —& ( tends to 
B=0 





2 a2 
as n increases; while the quantities -6(5 log =) and -3(S =) tend to zero. 
Op eat } p- OP op, } p=0 
Here L is the likelihood function. It follows (Whittle, 1953) that the variance of the 
maximum-likelihood estimator of / is, asymptotically, when £ = 0 


___ a{(1—p3) 
nox(1 + pi—2p,p2)" 
None of the statistics considered in this section are, therefore, asymptotically fully 
efficient in general, though (x,y, | 2,y,) is asymptotically fully efficient when p, = py. 





* There p and p, were used in place of p, and p, respectively. 





sis 


an 


6, 


he 


ly 





E. J. HANNAN 319 


One can, of course, suggest tests which are asymptotically fully efficient and which do not 
require the solution of non-linear systems of equations as the maximum-likelihood estimator 
does. Such a test will be obtained by using a consistent estimator of the serial correlation 
of the residuals to transform the regression equation to a form with an approximately 
random residual (Cochrane & Orcutt, 1949). Apart from the fact that the criterion of 
asymptotic relative efficiency is only suggestive of the relative power of tests in small 
samples it has an added deficiency when applied to tests which are not exact, for it does not 
take into account the effect of the deviation of the significance point used from the true 
value appropriate to the level of significance required. In the case of Quenouille’s tests his 
sampling experiments seem to show that on the null hypothesis the mean and variance of 
his statistics in small samples are near to their theoretical values so that the significance 
point used will be approximately correct. 

Example 1. Two series of seventy observations were chosen from series 7 in Kendall 
(1949), the last term in one series being sufficiently far from the first in the next to make the 
two series effectively independent. If the two series are represented by 2, and z,, a third 
series which was formed can be represented by 


Y, = 0-82, 4+ 2. 


The commencing term in the series z, is no. 191 and in z, no. 316. 














Table 2 
| | | 
Statistic | Sample value | Population value | Test statistic 
ee 15. Se | 
| | | 
b, 0-994 | 0-8 ty = 5:27** 
bs 0-417 0-8 To = 0-316 
1(%2Yo| 2) | 0-316 | 0-329 Ye, = 0-361** 
(29 Yo|X1Y3) | 0-684 | 0-625 Teg = 0°684** 
| 








** Highly significant. 


The three series now are Gaussian Markoff processes with first serial correlation 
equalling 0-9. 

The details of the various tests are given in Table 2. 

Here ts, indicates Student’s t with 30 degrees of freedom, while r,,, indicates a correlation 
coefficient with m degrees of freedom. 

The observed first serial correlation coefficients were 


11,2 = 0°755, 11, = 0°739. 
The degrees of freedom appropriate to rz) were calculated from the formula 


1 — (0-755) (0-739) _ 


n' = 1075 (0755) (0739) ~ 


20. 





The statistic ry) is not significant at the 5 % point. 











320 An exact test for correlation between time series 


3. AN EXACT TEST FOR CORRELATION BETWEEN TWO 
SIMPLE MARKOFF PROCESSES 


Let (Y:— Ha) = Pa(Y-a — a) + & 
(%— He) = Po(%1— He) + 


be two simple Markoff processes with normally distributed disturbances having zero means 
and variances o?(1 — p?) and o3(1— 2). 

The correlation between these processes can be prescribed in terms of the correlation 
properties of the residuals. Since these two residuals come from processes of independent 
random variates this will be achieved by saying for what lag, if any, the two residuals are 
correlated and what these correlations are. 

We will consider the case where the only non-zero correlation is between ¢, and 7,_,, and 
the series have been so lagged relative to each other that m is zero. 

If the correlation between ¢, and 7, is p, then the correlation between x, and y, is 





p{(1 —p?) (1—p3)}} a 
t>8: — = =" pf § = apts, 
(1—p,P2) Pa Pe 


2+ = s—t 
t<s: = apy’. 


The two series y, and 2, will be jointly normally distributed for every t so that 
o ¥ 
(Y-/y) = & =e (2% — fe) +S, 
2 


where ¢, is independent of x, and has mean zero and variance o3(1—«?). 

Then a simple calculation shows that &(¢,¢,_,) = o3(1—«*) pj. It follows, since ¢, is 
normally distributed, that it is generated by a simple Gaussian Markoff process with para- 
meter p, (Doob, 1953, p. 233). The disturbance 6, in the process ¢, = p,¢_, +9, is 


o o 
A, = a (P1— Pa) (21 — Me) +&— 7 


If p, +p, and p+0, 6, depends upon 2,_, and ¢, is not independent of x, for t+s. The 0, are 
of course independent. 

When p, =p, the present case is equivalent to that considered in the previous 
section. When p,+p, the exact test of significance of the correlation between the two 
series there given is still an exact test, but the conclusions as to the power of this 
test and the others there considered, when x, also comes from a Markoff process, do not 
follow, for when p+0 the condition (2) upon which the derivation depended does not 
- hold. 

The conditional distribution of the y,, (t = 1,...,[4(m—1)]), for fixed x, (¢ = 1,...,m) and 
Yu+1 (t = 0, ...,[4(n— 1)]) is the product of the conditional distributions of each y,, for fixed 
You-15 Yortr» Ty-1, Ly and %y,,. This follows from the fact that for fixed y,,,, and %y,, the yy, 
and 2,, are serially independent (Ogawara, 1951). This conditional distribution can be found 





fro 
wk 


ans 
ion 
ent 


are 


und 


; 18S 
wra- 


are 





E. J. HANNAN 321 


from the correlation matrix of the variates Yo 1, Yor41» Ly—1» Cops Voy+15 Yo (taken in that order), 
which is ee ‘A B 
ie ae 
Fl pi he Opy api 
Pil opt apy 
A=| a api 1 pp, pe}, 
“Pp, OP, pe, 1 pg 
Lops & ph pp 


B'= {P, Pi oP, % apy}. 


Then the variate y, is conditionally distributed with mean o,A~'BZ and variance 
o?{1 — B’A-"B} (Rao, 1952, p. 54). 





7 a (Yan Yor. Ton-r Te Toss| 


Here 
iat & in & Got 


The vector (A-!B)’ is easily seen to be 











lp p — P2&(1 — p; pr) a(1 — pip) — p,a(1—p,p)) 
L+pz\"* 1—p§ 1—p3 i—p ip? 
while o3{1—-B’A1By = LP) (Le) 
1+ pj 


For p, = P», these formulae are equivalent to those given in the previous section, when it is 
remembered that there the variance of the residual ¢, was o?, while here the variance of the 
corresponding residual is o7(1—p?). 

For p, + P2 two things are noticeable: 

(1) The regression coefficient of x,,_, is not equal to that for x,,,, so that some of the 
symmetry present when p, = p, is now lost. This loss of symmetry is also evident from the 
fact that , 
E{(%,— He) (Yrrs— Had} FEL (irs — He) (Y— Had}; 
when Pi #pP, and s+0 


and E{(Y:— 11) (45 — He) | YurYiat=9 s8<O, 
+0 s>0. 


(2) The coefficient of x,, is no longer equal to ac,/o, but instead to 
me" | 1— pips | 
oy \(1 +3) (1 —p3)) 
The asymptotic variance of the estimator, 6,, of this last coefficient is, for p = 0, 


20°41 — pj) (1 +73) 


no3(1 + pi) (1 —p3)’ 


72 
so that the quantity lim H var (b,) / E é0,)| 
no \ op p=0 
2(1 + p3) (1 +3) 


is — 
m1 + pi Ps)” 











322 An exact test for correlation between time series 


For the ordinary correlation coefficient, r, between x, and y, the corresponding quantity 

is (Bartlett, 1935 
is (Bartle ) _ 1p 
n(1 — p3) (1 —p3) 


The ratio of the second of these quantities to the first (the asymptotic relative efficiency 
of the two tests) is 





2 (1—p) (1-p%) 
which is evaluated for certain values of p, and p, in Table 3. For £(b,,r | p=0)>1 it may be 
inferred that, asymptotically at least, the test based on 6, is superior to that based on r. 


— 2292 2 
E(b,,r | p=0) = 1(1—pjp3) (1 + p12) 


Table 3. H(b,,r|p=0) 








Ps —0°8 —0°6 —)4 —0-2 0 0-2 0-4 0-6 0-8 

Pr 
0 0-85 0-57 0-51 0-50 0-50 0-50 0-51 0-57 0-85 
0-2 0-58 0-44 0-43 0-46 0-50 0-54 0-60 0-71 1-11 


0-4 0-36 0-32 0-36 0-43 0-51 0-60 0:69 0-85 1-36 
0-6 0-20 0-24 0-32 0-44 0°57 0-71 0-85 1-06 1-64 
0-8 0-11 0-20 0-36 0-58 0-85 1-11 1-36 1-64 2-28 






































As in § 2 we have 
: 0 f att ~ 
lim Spo ("tats | x4) ' = lim cov {r(v.y_| x), y Ty*T,Ty' x}, 
n> p= n—-> © 


the differentiation having been carried out under the integral sign and y now being in- 


dependent of x. Here I’, is the covariance matrix of the y,, T’, is the covariance matrix of 
the x, and 


1 
r, = ? [F{(y; — #1) (x; — /t2)}). 


The limit of the covariance can be evaluated in the same manner as in § 2 and gives 


: 0 
lim {5p (reat |z,))| = (1—p9)t, 


n> 2 


. 0 
lim {5 (rau | x) = (1 — p3)t, 


n—->@o 


: é 
lim fs é(r(eavs| 2149) =1. 


no 


Clearly once more (22%. | 2,¥y,) provides the most efficient test statistic. 





2 
It can be shown that, for p = 0, the quantities —& (4) are asymptotically zero. 
j /p=0 
Here L is the likelihood function and the 0; (j = 1, 2,3,4) are the remaining parameters 


2 
in that function. At p = 0, —& (° a “) = n. It therefore appears (Whittle, 1953) that the 


asymptotic variance of the maximum-likelihood estimator of p is n~! when p is zero. Since 








be 


of 


— 


E. J. HANNAN 323 


this estimator will be consistent both r and 6, are asymptotically inefficient as test statistics 
when /p, and p, are not zero, while r(x. y, | x,y) is asymptotically fully efficient. 

When p, and p, are high and of the same sign 6, will give an exact test of the null hypo- 
thesis which is more efficient than the test based on r. However, the test based on 6, is always 
much less efficient than that based on r(x, y, | x,y;). 

The computational burden of the exact test can of course be reduced by omitting one 
or both of x,,_, and x,, from the regression and the test will still be exact. It can be shown, 
however, that the test will be asymptotically less efficient. Another alternative is to use 
(Xy_1 +Xqy,1) in place of zy_, and z,,,. It can then be shown that the asymptotic efficiency 
of the test based on the coefficient of x, is unchanged. In finite samples, however, the power 
of the test may be adversely affected. 

The following example illustrates the use of some of the tests mentioned in this section. 

Example 2. In Moran (1952) the correlations between the records of the numbers of certain 
game birds shot over a period of years were considered. The series for caper and ptarmigan 
are considered in this example. 

The application of the test given in Quenouille (1947a) shows that both series are well 
represented by a simple Markoff process. 

The results of the various tests discussed above are shown in Table 4. Here r,, indicates 
a correlation coefficient with m degrees of freedom. 











Table 4 
E eo | | oe 
Statistic Sample value | Test statistic | 
| 
am | a 
r | 0-417 | 34 = 0-417* | 
b, | 0-444 | ts, = 2-484* 
| r(%_Y2| X14) 0-387 Tog = 0-387*** | 
* Significant at 5% point. *** Significant at 0-1 % point. 


The two observed serial correlations were 0-592 and 0-417 for caper and ptarmigan 
respectively, and these were adjusted up to 0-65 and 0-55 by Moran to allow for bias. Using 
these last values it appears from Table 3 that r and b, should give tests of roughly the same 
efficiency while r(x... | #,y,) should be approximately twice as efficient as either. 


4, THE CASE WHERE THE REGRESSOR PROCESS IS A SECOND-ORDER 
AUTOREGRESSIVE PROCESS 


The exact tests for correlation between two series which have been presented in this paper 
remain exact provided that the residual from the regression of one variate on the other is 
a Markoff process. The efficiency of the tests has been examined in the following cases: 

(1) The regressor process is a Markoff process wholly independent of the residual. In 
particular this covers the case where the regressand is also a Markoff process with the same 
parameter as the regressor and the residual. 

(2) The regressand and regressor processes are both Markovian. When the two para- 
meters are equal this case is equivalent to the particular case mentioned under (1) above. 

If neither of the two processes being correlated can reasonably be said to be Markovian 











324 An exact test for correlation between time series 


then, strictly, the exact tests presented here are not applicable, for the residual can then be 
Markovian only if there is in truth a relation between the series. (One might apply the test 
as an approximation to an exact test of course.) 

A case which may arise in practice is that in which one process appears Markovian while 
the other is an autoregressive process of higher order. The more interesting alternative 
hypothesis then appears to be one in which the correlation between the series arises from a 
correlation between the shocks of which they are linearly composed. 

= (Y.-H) = PrlY-a — Pa) + 
and (Xp — fa) + (yy — fa) + O(2%_2— Me) = % 
be two processes where, 


(1) |p,| <1 and all roots of z?+az+6 = 0 are less than unity in absolute value, 
(2) €, and 4, are jointly normally distributed with variances 


o2(1—p?) and o3{(1—b)(1+b—a?)(1+6)7} 
and correlation p. 


Table 5. H(b,,r| p=0) 


























“i | | | i 

— ag | | p,=06 p,=08 
ae = =0- | = =0°6 | — =- 1 Hy 

b | Pr=pP,=O0-4 | Pi=P2=0 Pi=P2=9°8 | py=0'8 p,=0°6 

| vbataaes et — 

-06 | 0-82 1-42 3-45 2-73 1-82 
—0-4 0-77 1-30 3-04 2-33 1-79 
—0-2 0-73 1-18 2-65 | 1-97 1-73 
0 0-69 | 1-06 2-28 1-64 1-64 
0-2 0-65 0-95 1-94 | 1-36 | 1-53 | 
0-4 | 0-62 0-86 1-64 1-12 | 1-42 
0-6 0-59 0-79 1-40 0-95 | 1-33 





Then by straightforward methods it can be shown that 


B(b,,7 | p=0) = (Lt PPa—b)*{1 + Pipa 2(% + PrP2)} {1 — P1P2—H(PrPs— PD} 
2(1 — pf) (1 — p3) (1-6) {1 + p3 — (1 —p3)} 
Here p, is the first serial correlation of the 2, process. 

At 6 = 0 this is equal to the quantity tabulated above. The ratio decreases with 6, for 
most values of p, and pg, at least in the neighbourhood of 6 = 0. For sufficiently large values 
of p, and pz it will be greater than unity. The ratio is tabulated in Table 5 for certain values 
of py, P2 and b. 

It is clear that the statistic r(xgy, | z,2,y.) will now give a test which is asymptotically 


fully efficient. The asymptotic efficiency of the statistic r(x.y,|2,y,) can be fairly easily 
evaluated and proves to be 





EXr (224s | XY), 7(XsYs | Te Yo) | P= 0} = (1—5%). 
The use of r(x,y, | x,y) will then result in little loss of information unless 6 is fairly large. 


Finally 1—b+ : 
E{b,, 1(L2Yo | %1Y;)} BS 2(1 + p3) ‘ —b) a +b)} y 








res 


tes 


. be 
est 


hile 
jive 
m a 





for 
1es 
1es 


lly 
ily 


ge. 





E. J. HANNAN 325 


This is less than unity for most of the range of values of the parameters involved though 
it becomes infinite as 6 tends to unity. 

Example 3. The data for grouse and blackgame from Moran (1952) may be used for 
illustration. The blackgame series appears to be Markovian but the first partial serial 
correlation for the grouse series is 0-34. The first serial correlation for grouse is 0-64 and for 
blackgame 0-43. Using 6 = 0-3, p, = 0-7 and p, = 0-5 we obtain 


E(b,,r| p=0) = 0-02, Bf{by,r(x2y2| x,y, | p= 0} = 0-47. 
The tests are shown in Table 6. 








Table 6 
} 
| Statistic Sample value Test statistic 
by 0-014 ty, = 2-56* 
r 0-347 135 = 0-347* 
| 7(X2Y2| XY) 0-327 Teq = 0°327** 














* Significant at 5% point. ** Significant at 1% point. 


5. GENERALIZATION 


The preceding exact tests can of course be generalized to multiple correlation and analogous 
results can be expected. 

A further generalization of both the simple and multiple correlation tests can be obtained 
when one process is an autoregressive process of higher order than the first. For a process of 
order h however only n(h + 1)~! observations will be used in the correlation (where n is the 
number of observations available), the remainder being used to reduce the series to in- 
dependence in time. The serial correlations would have to be very high indeed before the 
test began to compare favourably with one based on the ordinary correlation (or multiple 
correlation) coefficient. The computational burden involved in the production of the partial 
correlation (or partial multiple correlation) coefficient will also be great. 


6. SUMMARY 


An exact test of the correlation between two series has been obtained which can be applied 
whenever one of the two series is Markovian. If the two series are x, and y, and y, is Markovian 
the test statistic, b,, is the partial correlation between x, and y,, when the effects of 
(You-1 + Yours)» Vy, and xy,,, have been removed. The asymptotic efficiency of this statistic 
is compared with that of the ordinary correlation coefficient, r, between the two series and 
the statistic r(x_¥_ | 2,y,) suggested by Quenouille (1949), under the following conditions: 

(a) The residual process from the regression of y, on x, is independent of the x, process 
and comes from a Gaussian Markoff process. 

(6) The two series x, and y, are Markovian and are correlated through a correlation between 
the-(Gaussian) errors in the two processes. ’ 

(c) As in (6) but with x, generated by a second order autoregressive process. 

While in all three cases the statistic b, leads to an asymptotically more efficient test than 
r for high serial correlation of the residual process in the regression of y, on x,, the statistic 








326 An exact test for correlation between time series 


1(%Yq|2,¥y,) is always asymptotically more efficient than 6, with the exception of some 
cases, under (c), where the first partial correlation of the x, process is high and positive. 


I should like to thank a referee of this paper for pointing out the importance of 
Quenouille’s statistic. 


REFERENCES 


BartTueETT, M.S. (1935). J. R. Statist. Soc. 98, 536. 

CocHRANE, D. & Orcutt, G. H. (1949). J. Amer. Statist. Ass. 44, 32. 

CRAMER, H. (1946). Mathematical Methods of Statistics. Princeton University Press. 

Doos, J. L. (1953). Stochastic Processes. New York: John Wiley and Sons. 

Hannan, E. J. (1955). Biometrika, 42, 133. 

KENDALL, M. G. (1949). Biometrika, 36, 267. 

Moop, A. M. (1954). Ann. Math. Statist. 25, 514. 

Moran, P. A. P. (1952). J. Animal Ecol. 21, 154. 

Ocawara, M. (1951). Ann. Math. Statist. 22, 115. 

Pirman, E. J. G. (1948). Lecture notes on non-parametric inference. University of North Carolina. 
QuENOUILLE, M. H. (1947a). J. R. Statist. Soc. 60, 123. 

QUENOUILLE, M. H. (19476). Biometrika, 34, 365. 

QUENOUILLLE, M. H. (1949). J. R. Statist. Soc. B, 11, 68. 

Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research. New York: John Wiley and Sons. 
Stuart, A. (1954). J. Amer. Statist. Ass. 49, 147. 

WuitTt te, P. (1953). J. R. Statist. Soc. B, 15, 125. 

Wo tp, H. (1953). Demand Analysis. New York: John Wiley and Sons. 





me 


of 


ons. 





[ 327 ] 


SERIAL CORRELATION IN REGRESSION ANALYSIS. It 


By G. 8. WATSON 
Australian National University, Canberra, A.C.T. 


1. INTRODUCTION 


The procedure for linear regression analysis when the errors have a multivariate normal 
joint distribution with covariance matrix oa, where « is a given matrix are well known 
(see, for example, Aitken, 1934-5). However, in many practical applications, the matrix « 
isnot completely known. In the analysis of ordered observations, the errors are often serially 
correlated. Evenif theerrors are assumed to be generated bya stationary stochastic process of 
the moving-average or autoregressive type of lower order, the actual selection of the process 
and its parameters is a difficult matter, requiring samples of a size not often available. It is 
therefore certain that regression analyses will often be made using the wrong matrix a. 

Recently Hannan (1955) has given a method which, while not fully efficient, leads to 
exact tests when the error process is Markov with unknown p. This is an advance on previous 
writers (e.g. Wold, 1949; Cochrane & Orcutt, 1949), who aimed at the estimation of the 
serial correlations of the errors and hence the determination of a. But the basic difficulty 
still remains—the error process may not be Markov. Furthermore, while the difficulties 
are well known, the orders of magnitudes of the effects of using the wrong matrix a have 
received little precise study except for the case of unequal variances in the analysis of 
variance (see, for example, Box, 1954). 

For these reasons in Part I of this paper an examination will be made of the performance 
of a regression analysis based on the assumption that the error covariance matrix is o*y 
when it is, in fact, c?a. The methods used have their origin in the papers of Durbin & 
Watson (1950, 1951). When y+a the regression vectors play an important role in deter- 
mining the efficiency of the regression coefficient estimates and the true significance levels of 
significance tests. The procedure adopted is to find, for fixed « and y, the regression vectors 
which would make the analysis as ‘bad’ as possible; this leads to inequalities on the bias in 
the estimates of variance of the regression coefficients, on the efficiency of the estimates of 
the regression coefficients and on the significance points of the various ¢- and F-tests. 

In Part II, these inequalities will be applied to cases where o?a and o*y are the covariance 
matrices of first- and second-order autoregressive and moving-average processes. By using 
approximations to these matrices, it is possible to get simple algebraic and numerical results. 

We consider the linear regression of a dependent variable y on k (linearly independent) 
regression variables 2,, 2», ...,;,. The model for a sample of N observations is, in matrix 
rn "1 yy vee Uy | PP, Uy 

: Pade]: d, (1-1) 
YN Min vee Tey Br Un 

+ This work was carried out while the writer was a Research Officer in the Department of Applied 
Economics, University of Cambridge. The results of this paper and its sequel were presented at the 
Conference of the Econometric Society, Louvain, Belgium in September 1951. It formed part of a 
thesis which was issued by the Institute of Statistics, University of North Carolina, Mimeograph 
Series, no. 49. 


22 Biom. 42 











328 Serial correlation in regression analysis. I 
or y = [X,,...,X,]B+u, 
or y = X8+u, 
where the error vector u has zero expectation and 
E(uu’) = oa (a non-singular), (1-2) 


o?, a and B are supposed unknown. If the statistician, analysing these data, decides a priori 
that the error covariance matrix is o*y, he will essentially proceed as follows. Since y is 
known it is possible to find a non-singular NV x N matrix H such that 


HyH’ = I. (1-3) 
Writing Hy = y*, Hx;=x?f, Hu=u*, HX = X*, 
the result of applying H to equation (1-1) is the transformed regression equation 
y* = X*B+u*, (1-4) 
where u* has zero expectation and , 
E(u*u’*) = o*HaH’. (1-5) 


The regression equation (1-4) will now be treated by the least-squares procedures.+ Thus 
in our analysis of the effects of ‘guessing’ the error covariance matrix, we may without loss 
of generality proceed on the assumption that the least-squares procedure has been used 
on the model represented by equations (1-1) and (1-2) and merely replace a by HaH’ in 
the final results. 

On these assumptions, the estimate of B used would be 


b = (X’X)"1X’y. (1:6) 
The covariance matrix of the components of b is actually 
o%(X’X)-1 X’aX(X’X)-, (1-7) 
but it would be taken as 
o*(X’X)-1, (1-8) 


where o* would be estimated by 
— (¥—Xb)'(y— Xb) 








s? Vuk (1-9) 
The true expected value of s? is given by 
E(s%) = Wo E((u—X(X’X)- Xu)’ (u—X(X’X)X’u)) 
1 
Gee. euanedinds ‘yn — nn’ ’ —] , 
= 5 u—u’X(X’X)— X’u) 
=—"_ (tra—trX’aX(X'K)— 1-10) 
om voz! Ta ( ) ). ( 


- For arbitrary X this reduces to o* only when « = I. 


+ By the ‘least-squares procedures’ is meant not only the estimation of the regression coefficients 
by least squares but also the use of the variance estimates and the t- and F-tests which are appropriate 
to the case where the error covariance matrix really is Io*. We might have introduced the name 
‘I-procedure’ to cover this. Thus when the statistician decides the error covariance matrix is ¢*y 
he will use the ‘y-procedure’. 





Th 


co 


th 
we 


an 





18) 


1-9) 


10) 


iate 





G. S. Watson 329 


2. BOUNDS ON THE BIAS OF THE VARIANCE ESTIMATES 
The bias in the estimate of the covariance matrix of the regression coefficient estimates 
may be found from (1-7), (1-8) and (1-10) and is given by 


tr a — tr X’aX(X’X)-1 
2(XK’X)-1 
o?(X’X) ( Vuk 





I- X'aX(X')=) ' (2-1) 
In order to examine the magnitude of the effect on the variance of a single regression 
coefficient we now assume that the regression vectors form an orthonormal set, i.e. 

X’X =I. (2-2) 


The condition (2-2) invoked to simplify (2-1) leads to a loss of generality; but, without it, 
the extreme value problem is very difficult. Fortunately we will not be so restricted when 
we come to discuss significance tests. On this assumption the estimate of /; will be 


b; = Xiy, (2-3) 


and the bias in the estimate of its variance will be 


= wen T  - xjex,) 
N-k 7 if> 
k 
tra— } xjax; 
° j=] , ‘ 
that is, 0° \ aE Te - (2-4) 


We now require the extremes of the expression (2:4). These will be seen to follow easily 
from the following algebraic result: the extremes of the quadratic form Z’aZ, where Z is 
a unit vector in a subspace spanned by a subset of the latent vectors of a, are the least and 
greatest latent roots of a associated with the subspace. Thus the lower bound of (2-4) is 
obtained by choosing one of the x-vectors, say x;, to be the latent vector of « corresponding 
to its largest latent root and the others to be the latent vectors corresponding to the next 
k—1 largest roots. The upper bound is found by a similar choice using the smallest roots 
of a. More generally if h ( < k) of the regression vectors are latent vectors of a, the remaining 
regression vectors lie in an N —h dimensional subspace. If x; lies in this subspace, the 
above results hold good if the choices are made with respect to the set of the N —h latent 
roots associated with this subspace. If x; is a latent vector, a trivial modification is necessary. 
Thus for X subject only to (2-2) 


(a2) Maximum bias = o*((mean of N —k greatest roots of a) — (least root of a (255) 
5 


(6) Minimum bias = o7((mean of N —k least roots of «) — (greatest root of a)). 


If the multiple o? is omitted the results are in the nature of bounds on the fractional bias. 
It will be noticed that (2-5(a)) is positive and (2-5(b)) negative whatever a may be. It is 
commonly believed that presence of serial correlation makes the variance estimates 
deceptively small. The numerical results given in Part II show this to be the stronger 
tendency, but (2-5) shows that this need not always be the case. The deciding factor is seen 
to be the relationship of the regression vectors to the latent vectors of a. Finally, (2:5) 
shows how the bias must tend to zero as aI. 


22-2 











330 Serial correlation in rearession analysis. I 


3. LOWER BOUND TO THE EFFICIENCY OF THE ESTIMATES 


We turn now to a consideration of the efficiency of the estimates (1-6) of the regression 
coefficients. It is necessary to determine how this efficiency falls off as a differs more and 
more from I. To consider the efficiency of the estimation procedure as a whole it is con- 
venient to introduce the generalized variance of the estimates. The generalized variance of 
the estimates b, of (1-6) is defined to be the determinant of their covariance matrix (1-7). 
Aitken (1948) has shown that the linear estimates of £; with the least generalized variance 
are those given by (X’a-!X)-! X’a-ly; they have the covariance matrix o?(X’a—!X)-. 
The efficiency of the estimates (1-6) is then naturally defined as the ratio of the deter- 
minants of the covariance matrices o7(X’a-'X)-! and (1-7). Thus 


|X’X |? 

Eff. (b) = [XaX |] Xe] (3-1) 
It should be noted that Eff. (b) is invariant when X is transformed to P by P = XG. It is 
always possible to find G so that P’P = I. Thus it may be assumed without loss of generality 
that X satisfies (2-2). 

If all the regression vectors are latent vectors} of a, Eff.(b) = 1. This may be verified 
immediately since only the diagonal terms in X’X, X’aX and X’a-!X are then non-zero. 
The result finds an application in the work of R. L. & T. W. Anderson (1950) on testing for 
circular serial correlation when a short Fourier series has been fitted by least squares. In 
this problem the regression vectors are the latent vectors of the quadratic form occurring 
in the joint density function of successive observations from a circular autoregressive 
process so that the estimates of the regression coefficients are optimal whether serial corre- 
lation, of the type envisaged, is present or not. 

The expression (3-1) tends of course to unity as «>I. The important question here is 
how small Eff. (b) may become, where a is fixed and not equal to I but when X is unrestricted. 
This lower bound to the efficiency may be obtained by generalizing an inequality due§ 
to J. W. S. Cassels. Cassels’s inequality states that for fixed a;>0, 6;>0 and all w;>0 


(j = 1,...,.) 2 : 
¥a;6,2;) 
_4rR =< _& era lp (3-2) 
(r+ R)* pd . ae 
(S43 w; uO; w; 


where r = mina,/b;, R = maxa,/b;. If the numbering of the a; and 6, is so arranged that 
j j 


r = a,/b, and R = ay/by, then the lower bound is attained for w, = 1/a,6, and wy = L/ayby, 
w; = 0 (j+0, NV). The lower bound in (3-2) is the square of the ratio of the geometric to the 
arithmetic mean of r and R. 

To obtain the minimum of Eff.(b) by varying X subject to the restriction that the 
regression vectors are orthogonal, we first eliminate from the problem any regression 





_ + In contrast to this, the bias matrix (2-1) is not zero when the regression vectors are latent vectors 
of a. 
t This result is, however, little more than a mathematical curiosity. For apart from the fact that 
the Andersons’ test has power against other alternatives, such estimates while optimal cannot be 
validly tested in the usual way. 
§ A proof of this inequality is given in the Appendix. The inequality and the proof were com- 
municated privately by Dr Cassels to the author, who wishes to record his gratitude. 





Th 


ve 


nD ees 





> 0 





G. S. Watson 331 


vectors which are latent vectors of a. Suppose that x;,_),,,..-,X, are latent vectors of « 
associated with the latent roots a _,,,,...,@y. If the remaining x; have co-ordinates 
Z(t = 1,...,.N) when referred to the latent vectors of « as axes 





k—-h /N—-h 2 
(= 2) 
Eff. (b) = y=] (3-3) 
| x aay >» = Zaks) 
t=1 t=1 % 





The matrices X’aX and X’a—!X of (3-1), and therefore the matrices in the denominator of 
(3-3), are positive-definite by their form. By a theorem of Hadamard (see Hardy, Littlewood 
& Polya, 1934, p. 34) if c,; are the elements of positive-definite matrix 





|c;;| SZ Cy1 Coq --- Cyn- (3-4) 
N-h 2 
Thus pers ( > Z1) 
Eff. (b) > T] i=} 


Without loss of generality we may assume that a, <a,<...<ay_,. We now choose the Z 
vectors to minimize the right-hand side of (3-4) applying Cassels’s inequality successively. 
When this is done, the matrix X takes the form 





A; +Aay_, Ag+ Ay_p_1 Appt An_pit Pp a 
J2 2 gp tees if et at oe > AnN—h4i> -++> Qn] + 


Using (3-2) with a, = Ja, a,b, = 1 and w, = Z},, we have the result} 
40, yp Ag hy _p_y 40tp, py by pest 
(Oy +.&y_p)? (Ag+ Hyp)? (Op_n + %y_p41) 





5 < Eff. (b) <1. (3:5) 


Thus in spite of the use of Hadamard’s inequality, we have arrived at a lower bound for 
Eff. (b) which is attainable. This is so because we were led to a choice of X for which X’X, 
X’aX and X’a-'X are diagonal. The general rule for the formation of the lower bound to 
the efficiency is evident. From the N — hroots of a not associated with any regression vectors, 
we choose successively the (k—h) most extreme pairs. The lower bound to the efficiency 
thus depends on the ratios of the extreme roots of a. When the roots of « are all approximately 
equal (approximately equal to unity if the error process is stationary) the efficiency must 
be high. It will be shown in Part II that high serial correlation of the autoregressive and 
moving average types leads to « having roots of very different magnitudes and therefore 
to the possibility of low efficiencies. Finally, it will be noted that the maximum and 
minimum efficiencies are taken when the arbitrary regression vectors are respectively 
latent vectors of a and sums of certain pairs of latent vectors of a; using the invariance pro- 
perties of (3-1) these statements serve to define the maximizing and minimizing regression 
spaces. 


4, BOUNDS ON THE SIGNIFICANCE POINTS OF THE t- AND F-TESTS 


The significance tests usually made in regression analysis relate to linear hypotheses about 
the regression coefficients. Thus in the model (1-1), these hypotheses will be special cases of 
a null hypothesis specifying the value of one or more linear functions of /,, ..., 8. When the 


+ This form of the lower bound was conjectured by J. Durbin. 











332 Serial correlation in regression analysis. I 


errors are independently distributed with the same normal distribution, it is well known 
that the optimal test statistics for such hypotheses have the t- or F-distributions. In this 
section bounds will be obtained for the distributions of these statistics when the covariance 
matrix of the errors is o*a; these bounding distributions converge to the tabulated dis- 
tributions as aI. We begin by showing that there is no loss of generality in considering 
only hypotheses specifying the value of several regression coefficients and in assuming that 
the regression vectors form an orthonormal set. The bounding distributions for this 
reduced problem will then be derived. The section concludes with a short discussion of the 
numerical determination of the bounding significance points. 

We consider again the model (1-1), y = X®+u, where the errors u have a multivariate- 
normal distribution with zero mean vector and covariance matrix o*a and where the 
regression vectors are linearly independent. We suppose that the null hypothesis to be 


tested is 'B=¢9, (i=1,....h<h), (4:1) 


where f,, 4; (i = 1, ....2) are given and f; (i = 1, ...,2) are linearly independent. The optimal 
test of the hypothesis (4:1) when a = I is obtained by building up a quadratic form in the 
variables f; b — ¢; (b is the least-squares estimate of 8 defined in (1-6)) which is distributed 
as a multiple of y* with A degrees of freedom. The form must therefore be based on the 
inverse of the covariance matrix of the variables f;b-—¢,;. Because o? is unknown it is 
necessary to divide this form by the estimate s? of a? given in (1-9) in order to obtain a test 
statistic. The ratio so obtained has the F-distribution with h and N —k degrees of freedom 
provided the null hypothesis is true. This is the statistic we must examine when a is not 
necessarily equal to I. 
If X is transformed to Z by the transformation P 


Z = XP, (4-2) 
it is always possible to find a non-singular P so that Z’Z = I. For P has only to satisfy 
P’X'XP = I. (4:3) 
Defining § = P-'8, where P satisfies (4-3), (1-1) becomes 
y = Z6+u. (4:4) 
The least-squares estimates of 8 and 8, b and d say, are given by b = (X’X)-! X’y, d = Z’y. 
It is easily seen that d =P». (4-5) 
The denominator of the test statistic is s?, and it has been seen in the derivation of (1-10) 
that s* is proportional to w’u—w’X(X’X)1u, (4-6) 
which becomes, under the transformation (4-2) and (4-3), 
wu —u’Z(Z’'Z) Zu = u’'u—u’ZZ'u. (4-7) 


Thus (4-6) is invariant under transformations (4-2) and (4-7) is invariant under trans- 
formations (4:2) with PP’ = I. On the null hypothesis, the variables in the numerator 


are given by : 
fib—¢, =f(X’X)*X’u (i = 1,...,h). (4-8) 





be 








G. S. Watson 333 


Thus for i = 1,...,4, £;b—¢,; are distributed as f;b when B = 0 or as f; Pd when & = 0. We 
may express the numerator quadratic form in terms of d without first stating it in terms of b. 
Writing f;P = g;, it must on the null hypothesis have the form 


#8, --- :8,]~ [eid 
[e:d,...,8,d]] : ; i, (4-9) 
8.8: --- 8,8) Led 


because this is seen to be correct when a = I. (4-9) may be written as 


£:8: --- 8,d,]"[8]4. 
d'[é,, ae £] : 3 3 (4-10) 
£8: --- 8i8r £3, 


The matrix of this quadratic form in d is symmetric and has h unit roots and k —h zero roots. 
Thus an orthogonal transformation H may be found so that if d = Hc, (4-10) reduces to 





ce’ Jc, where J af & ‘J 
0, 0; : 

a k x k matrix with units in the first h places on the leading diagonal and zeros elsewhere. 
Since H is orthogonal, c = H-1d = HZ. (4:11) 
If finally we put W = ZH, (4°12) 
then c = W'u, 
and the test statistic is proportional to 

uWJWu (uw, )? + -+. + (uw)? (4-13) 

wu—u’WW'us‘'u—(u'w,)? —... —(u'w,)”’ 


where Wj, ..., W;, form an orthonormal set of vectors. This is the desired canonical form for 
the test statistic because it is the form which arises in testing the hypothesis 


in the model y = Wy +u, 


where W’W = I and & = {y,,..., w,}. c defined above is the least-squares estimate of p. 
Since the proof above for the general linear hypothesis presents some difficuity, it seems 
worth while to establish the result for a simple particular case. Suppose that the null 


hypothesis is B = Bp. (4:14) 


To obtain the form of the test statistic, we refer to (1-8) which defines the covariance 
matrix of b —, and (1-9) which gives an expression for s?. Thus the statistic is proportional 


es (b By)’ (XX) (b=). (4-15) 
(y — Xb)’ (y— Xb) 
Since b = (X’X)-'X’y, the statistic (4-15) on the null hypothesis may be put in the form 
u’X(X’X)-! X’u (4-16) 
u’u—wu’X(X’X)-! X’u 
If now X is transformed to Z, so that Z’Z = I, the discussion of (4-2) ... (4°7) shows that 
(4-16) is reduced to wZZ'u 








angen: - (4-173 
uv’/u—u’ZZ'u 


This is the required canonical form for the hypothesis (4°14). 











334 Serial correlation in regression analysis. I 


We consider the model of (1-1) and (1-2) under the assumptions of orthonormality (2-2). 
The errors u will now be assumed to have a multivariate normal distribution with zero mean 
vector and covariance matrix 07a. We suppose initially that the first k — 1 of the regression 
vectors x are latent vectors of a, the remaining vector being arbitrary. Denote x; = a; 
where aa, = «;a; (i = 1,...,4—1). To test the hypothesis 





By = Bi (4:18) 
the appropriate statistic is 
(N —k) (b, — BR)? 
@= eae (4-19) 
>» (y- = bay ~ Datu) 
t=1\ 1 


On the null hypothesis ¢?/(N —k) has the matrix form 


t u’x,X;U 
apie a eee . a oT, Say. 4-20 
N-k uw'(I-a,a,—...—a,_,a,_, —X;,X;,)U 7. ( ) 





Bounds on the significance points of 7' will now be found which enclose the true significance 
point of 7’, no matter what the direction of the vector x,,. 
By a device, often used in the treatment of ratios, we have 


P(T' <7) = P(u'(x;,.x;,—7(1—a,a,—... — a,_,,_, — X;,X;,)) u< 0). (4:21) 

N 
But the quadratic form on the right-hand side of (4-21) is distributed like > v;,v?, where 
i=1 


v, are the roots of the determinantal equation 
| X,X,—7(I—a, aj —... —@,_,@,_, — X,.X;) — va! | = 0, (4-22) 


and the v; are N.1.D. (0,1). For fixed 7, let x; vary. Then the v; and P(Xv,;v? < 0) vary. We 
will show that there exists a direction of x, such that all the v; take maximum values. For 
this x,, P(Xv,v? < 0) is clearly a minimum. Since, for fixed x,, P(7' <7) is a monotonically 
increasing function of 7, it follows that the significance point of 7’, 7,, say, for this special 
maximizing direction is greater than the significance point at the same level for any other 
vector X,. In the same way a lower bound 7, will be established. 


For this purpose apply to (4-22) the orthogonal transformation K which diagonalizes a. 


If K’Xx,, =m, 


1/a, 

4-23 

K’a—!K ah 1/a, ‘ ( ) 
1/ay, 


then the first k— 1 co-ordinates of m are zero because X,, is required to be orthogonal to the 
k—1 latent vectors corresponding to a,,...,%,_,, and 


fon 


K’(I—a,a;—...—a,_,a,_,)K = (4:24) 














20) 


1ce 


21) 


ere 


22) 
We 


~ 


‘or 
lly 
ial 
her 


3 a. 


23) 


the 


24) 








G. S. Watson 335 


where the matrix on the right-hand side of (4-24) is an N x N identity matrix with the first 
k—1 diagonal elements replaced by zeros. Thus equation (4:22) has k—1 zero roots and 
N —k+1 which satisfy the reduced determinantal equation 


: v 
| (L+7)mi—1——, (1+7)m,m,,,, Peas (1+7T)m,my 
k 
| 
| v 
l Mz, M,, 1+7)m,.,—-T—-—_, ...,_ (1 . 
| (1+7) mj... ™, (L+7)m,,—-7 a (L+7) m,,,my =0. (4:25) 
| . bd | 
| * | 
(1+7)mym,, (1+7)mym,.,, ill (1 +7) mi —7— | 


< 


Subtracting m,,,,/m,, times the first from the second row, etc., and expanding the result as 
a bordered determinant we find the equation in pv, 





N y N y 
Il (r+2)-a+n Em 11 (r+2) =0, (4-26) 
p=k \ ry p=k i+p 3 
N Z 1 
i.e a in (4-27) 
k Vv 1+T7 
ena 
Xp 


We see that the effect of the regression variables chosen as latent vectors has been the same 
as on previous occasions, i.e. the corresponding latent roots are eliminated from the pro- 
blem. We may therefore without loss of generality suppose that 


Ap < Api < eee <Ay. (4-28) 


The location of the roots v; is easily seen graphically from (4:27). We may, however, proceed 
as follows. Define the continuous function of v, 


N 


- . a | v 
fv) = (r+=)-(+) > (r+Z). 
p=k ry p=k j=k a; 
j*+p 
Z N / a: 
Then since f(—Ta,) = —(1+7) m?7*—~ TT] (1 -%1) 
j=k a; 
j+i 


is seen to have the opposite sign to f(—7«,,,), the equation (4-26) must have at least one root 


in each of the intervals 
(—T&;41, —T2;). 


But, considering equation (4-27), we see that 








N 2 N m2 
oe. Che Lee 
— ™~ ad a“ 
k &y lit a 1+7 
T+ T+— 
a a 


N 
by (4:28) and the fact that  m? = 1. Hence there is at least one root of (4-26) and (4-27) 
k 


in the interval 
(Xj. &y). 


Combining the results, we have proved that there is one root v; in each of the N—k+1 


intervals (—To;41, —T2;) (a = k, WV 1), (Op, Gy). (4°29) 











336 Serial correlation in regression analysis. I 
Finally, if x, is chosen to be the latent vector associated with the root «,, ie. a,, (4°26) 
Grease the. watts —Tiy, ..-, —TOp4, and a, (4-30) 


which are seen from (4-29) to be the least possible set of roots. If x, is chosen to be the 
latent vector associated with a,, i.e. ay, (4:26) gives the greatest possible set of roots 


—Tdy_1, «++» —T&, and ay. (4:31) 


We have therefore proved 


THEOREM 1. Ifrissuch that P(T' <7) = C(0<C <1), T being defined by (4-20), then there 
exist numbers 7;, 7;, such that 
Ti <T<Ty (4°32) 


for all vectors x,. 7, and 7,, are determined by the equations 


P 
P( Oe Me <t1) “ c,| 





so ae 2 
Lest Mksy t+ ++ Foy In 


/ 2 
P( An Yn < rv) = O, 


2 2 
Ap De + ++ + by YN-1 


(4:33) 





where 7;,, .--,y— are N.1.D. (0, 1). 


We now extend this theorem, with the same assumptions about the model, to a joint 
test of the null hypothesis 


by = Fi, + Br =p, (h<k-1), By = Br (4-34) 
The statistic 7’ for this hypothesis is on the null hypothesis 


u’a,aju+...+u’a,a, u+u’x,x;U 





| a aa eC (4°35) 
u’(I—a,a,—...—a,_,a),_, —X,X;,)u 
Proceeding as before the determinantal equation is found to be 
|a,aj+...+a),a), +X,x,—7(1—a,aj—...—a,_,a,_,—X,X;,)—vat| = 0. (4-36) 


After applying the orthogonal transformation K of (4-23) the roots v; are seen to be 
Oy, Ag, «5 X,, kh-—h—1 zeros, 


and the N—k+1 roots of (4:25). This leads immediately to the following extension of 
Theorem 1. 


THEOREM 2. If7issuch that P(7 <7) = C,(0<C <1) T being defined by (4-35), then there 
exist numbers 7, and 7,; such that 
q Ty STS Tu 


for all vectors x,. 7; and7, are determined by the equations 


p(-, 0+ oss toy Tht Me — ) ae 

Oy Mesa t+ Fey TN , (4:37) 

p(an: tis + nh + Oy Tv < re) = o,) 
Ap Yet ~ +by_19n-1 


where 7;, ---; ns Ix «++» Jy are NLD. 








6) 
0) 
he 


1) 


re 


2) 


3) 


sre 


37) 





G. 8. Watson 337 
To complete this case, we examine the test of the null hypothesis 
B, = PS, «5 Br =f (h<k-1). (4:38) 
The T' statistic is here defined on the null hypothesis by 


u’a,aju+...+u’a,a,u 





T a ae 7 7 ee 2 (4:39 
u’(I—a,a,—...—@,_,a,_, — X;,X;) U 

The associated determinantal equation is 
| aja, +...+a,a,—7(L—a,aj—...—ay_,ay_1—X;,X,) —va!| = 0. (4-40) 


The roots of (4-40) are «,,...,a,, —A zeros and the roots of the determinantal equation 


p 


TM, — 7 — — : TM, Mp15 Ase TM;.My, 
k : 
TMp,1M,, TME,,—-T———. ..., TMy,,My 
k+1 "hk k+1 Gens k+1°°°N ar (4-41) 
: | 
‘ v 
TMy Mz, TMy My 1 vey TMY—-T-— 
’ : ; a 


Expanding as before, we find the equation in v 





T+ at Sm? TI (r+ Y) =0, (4-42) 
ji x; 
a 
Le. = y—. (4-43) 
T i=k Vv 
r+— 
a 


As with equation (4-26), it can be shown that equation (4-42) has at least one root in each 


of the intervals 


(—T0;41, —TQ,). 


i+D 


By inspection we see that it always has a zero root so that there is just one in each of these 
intervals. If x, is chosen to be the latent vector of « associated with the root «,, i.e. a,, 
the least set of roots for v; is obtained, while if x,, is the latent vector of a associated with the 
root &y, i.e. Ay, the greatest set is found. We may thus state 


THEOREM 3. Ifrissuch that P(7’ <r) = C(0<C <1), T being defined by (4-39), then there 


exist numbers 7; and 7, such that 
Tr STSTy 


for all vectors x,. 7, and 7,,; are determined by the equations 








. » 
P( a theca, Wlly- ty 7S 
era Meza t ++» ben In 
m2 m2 
P( a+ ees + nh <tr] e o,| 
Xe MH + Fey YN -1 
where 7; --- 9p. 4x --» Yn are N.1.D. (0, 1). 

We now proceed to a case where there is more ‘than one arbitrary regression vector. 
Suppose that ail & regression vectors are arbitrary except for restriction (2-2) and that the 
hypothesis to be tested is 3 

yP B, =P (i=1,...,k). (4-45) 


(4-44) 














338 Serial correlation in regression analysis. I 


On the null hypothesis the usual F-ratio is distributed like a multiple of 





u’XX’u 
?= wd-xx)u' eh 
T u’XX’u 
therefore R= ee hee It (4-47) 


1—-R 
tremes of P(7' <7). We therefore proceed to find distributions which bound that of R above 
and below. If H is the linear transformation such that 


Since 7 is a monotonic function of R (7 = ial , the extremes of P(R<r) will yield ex- 


HaH’ = I, (4-48) 


then the variables v = Hu are n.1.D. (0, 1)—for o? may be set equal to unity without loss 
of generality. Then 





R= v'H’*XX’H"'v 
~  v(HH’)v ~ 
From (4-48), a, 0 
(HH’)* = * | ; 
0 ay 
so that we may write 
Ey/—1 ty-1 
Rw See i (4-49) 
Dx, 
I 


R is a ratio of non-negative definite quadratic forms in N.1.D. (0,1) variables. For variation 
of X, the upper and lower bounding random variables for R are obtained when the numerator 
takes its least and greatest forms. Now the numerator of R is distributed like 


k 


x 47; (4:50) 
i=1 
where 7;, ..-, J, are N.1.D. (0,1) and v,,...,¥;, are the k non-zero roots, written in ascending 
order of magnitude, of : 2 
8 | H’XX’H-!—yI| = 0, (4-51) 


i.e. the non-zero roots of 


| H-H’-!XX’—rI| = 0, 
since the roots of a product are indifferent to the order of multiplication. But. by (4-48) 
H-'H’— = (H’H)—" = a, 
so that v,<...<v, are the k non-zero roots of 
|aXX’—vI| = 0. (4-52) 


Suppose that Z,,...,Z,y_, are vectors which together with x,, ..., x, form a complete ortho- 
normal set in N-space. If Z = [z,, ...,Zy_,], then, 


[X, Z}][X, Z)’ = [X, Z]’ [X, Z] 
=I. 














Bu 


so 1 


fol 


SS 








G. S. Watson 339 
But [X, Z][(X, Z]’ = XX’+Zz’, 
so that XX’ = I-ZZ’ 

= (I—2,Z}) ... (I—Zy_,Zy_,) 

= M,...My_,, say. 


where M,,...,M,_;, are idempotent matrices of rank N — 1. 

With this characterization of XX’, inequalities on the roots of (4-52) may be obtained 
immediately from the investigation of Durbin & Watson (1950). They show, with a proof 
similar to that of Theorem 1, that the roots of aM, lie between the roots of a. Thus the non- 
zero roots of aM,M, lie between the non-zero roots of aM,. Combining the inequalities 
given by successive applications of this result, it follows that the roots v,,...,v,, of (4:52) 
must satisfy the inequalities 


Hy Vy Say _py, %gSVgCAy_pyo, vee, Ap QVy Cay. (4°53) 


The least set of roots is obtained when the k regression vectors are chosen to be the latent 
vectors a,,...,a, (or linear combinations of these vectors). The greatest set of roots is 
obtained when the regression vectors are similarly related to the latent vectors ay_,.,..... 
ay. These choices lead to the extreme forms of R required. We may thus state 


THEOREM 4. If7issuch that P(T’ <7) = C(0<C <1), T being defined by (4-46). then there 
exist numbers 7, and 7,,; such that 
TLSTSTy 


La 


for all X. 7, and 7,, are defined by the equations 


Oy BET + + eM 


aif tatt 2.) 6 
oe +... tay BR 


2 2 6 
Se ee crgh =O | 
Oy N+... + Hy EMV —~ 


(4:54) 


where 77;, ..., Jy are N.1.D. (0, 1). 

The rules of formation of the random variables which lead to the bounding significance 
points are now evident from Theorems 1, 2, 3, 4 so that there is no need to give the details 
for every type of test which may arise. In all cases it is seen that the distributions of the 
bounding variables converge to the appropriate F-distribution as the ~’s approach equality. 

The calculation of the bounding significance points discussed above presents some 
difficulty. We are faced with the determination of the significance points of ratio of two 
independent weighted sums of squares of normal deviates. Since the weights are all positive. 
the method of Pitman & Robbins (1949) may be used when the «’s are not too different. 
Some numerical results obtained by this method will be given in Part II. The amount of 
arithmetic required is always large—the electronic computer of the Cambridge Mathematical 
Laboratory was used in these calculations—so it is worth while examining approximate 
methods. 

To illustrate the approximations used in Part IT, let us consider the random variable 
which generates the lower bound in Theorem 4, namely, 


pss Wesa to Fay In 











340 Serial correlation in regression analysis. I 


The simplest approximation to this ratio is 








k k 
> a; da; 
7 N-kgtt..tm 1. 4 oo 
kN g@yt..tmy Ww _*hy-» 
Sa D wy 
k+1 k+1 


where F, y_, has the F-distribution with k and N —k degrees of freedom. If in a practical 
analysis we had some knowledge of the correlogram of the errors, this approach suggests 
that multiples of the usual significance points might be used to make a reasonable test. 
A better approximation may be obtained by replacing the numerator and denominator by 
multiples of o? with the correct variances as well as the correct means. The significance 
points of this approximation must be obtained by interpolation and do not have the simple 
interpretation of the first approximation. 


The author has pleasure in recording his indebtedness to Mr J. Durbin for many helpful 
discussions. 


APPENDIX 
A proof of the inequality (3-2) 
THEOREM. Let a;>0, 6;>0, w,>0(j = 1,...,n). Then 


(Haj ws) (TO; s) — ay (ads +aybi)® 


l< <ma 
(Sa,;5;w;)* ij 40,;5,, 


The extreme values are attained where at most two of w,, ..., w,, are non-zero. 
This theorem and its proof below are due to J. W. 8. Cassels. 
Proof. As a straightforward extremal problem it may be shown that 


(1+ kw) (1+ k-Mw) _(1+k)(1+k7) 
(1+w)? 





if k>0,w2>0. 


s 


Thus the theorem holds for n = 2—for put 


Agb,Ws __ yb, 


= ’ 7 , 
a,b,w, a,b, 





Furthermore, the extremes of the theorem are attained when at most two of w, ...,W, 
are non-zero. It may be assumed without loss of generality that 


at af aC 
b, by by | #0. 


| a,b; a,b; a,b, | 


|a; @ 
For this determinant is given by [] | ‘ ; 
ig, k| b; 


5. so that if it vanishes we have say 
t j 





But if this is so the problem is reduced to one in n — 1 variables. 














ful 

















G. S. Watson 341 


To prove that the extrema of (Zafw,) (Xb}w;)/(La,b,w,)? are attained when at most two 


of w, ... Wy are non-zero, let M be an extreme which is attained when say w, + 0, w, +0, w,+0. 
Then we have 


d 
+ {(Za?w) (Zb}w,)— M(Za,b;w,)?} = 0 (k = 1,2,3), 
and so a X+biY—2Ma,b,Z=0 (k =1,2,3), 


where X = La}w;, Y = Xbjw,;, Z = xa;b;w;. But the determinant of these equations does 
not vanish. Hence X=Y=MZ=0, 


which is impossible. 
This establishes the theorem. To obtain the inequality (3-2), it is only necessary to notice 
that the upper bound in theorem may be put in the form 


1 /(r; , 7;\? 
max—|—+—j, 
6g AV % 


where r; = a,/b;. Since the function w+ 1/w takes its maximum at the end of the range of 
variation of w, (3-2) follows. 


REFERENCES 

AITKEN, A. C. (1934-5). On least squares and linear combinations of observations. Proc. Roy. Soc. 
Edinb. 55, 42-8. 

Aitken, A. C. (1948). On the estimation of many statistical parameters. Proc. Roy. Soc. Edinb. 
62, 369. 

ANDERSON, R. L. & ANDERSON, T. W. (1950). Distribution of the circular serial correlation coefficient 
for residuals from a fitted Fourier series. Ann. Math. Statist. 21, 59-81. 

Box, G. E. P. (1954). Some theorems on quadratic forms applied to the study of analysis of variance 
problems. II. Effects of inequality of variance and of correlation between errors in the two-way 
classification. Ann. Math. Statist. 25, 484-98. 

CocHRANnE, D. & Orcutt, G. H. (1949). Application of least squares regression to relationships con- 
taining autocorrelated error terms. J. Amer. Statist. Ass. 44, 32-61. 

Dursin, J. & Watson, G. 8S. (1950). Testing for serial correlation in least squares regression. I. 
Biometrika, 37, 409-28. 

Dursin, J. & Watson, G. S. (1951). Testing for serial correlation in least squares regression. II. 
Biometrika, 38, 159-78. 

Hannan, E. J. (1955). Exact tests for serial correlation. Biometrika, 42, 133. 

Harpy, G. H., Lrrrtewoop, J. E. & Potya, G. (1934). Inequalities. Cambridge University Press. 

Pirrman, E. J. G. & Rossrns, H. (1949). Application of the method of mixtures to quadratic forms in 
normal variates. Ann. Math. Statist. 20, 552-60. 

Wo tp, H. (1949). On least squares regression with autocorrelated variables and residuals. Trans. 
Int. Inst. Statist. Reprint, pp. 1-13. 











[ 342 ] 


SOME THEOREMS AND SUFFICIENCY CONDITIONS FOR THE 
MAXIMUM-LIKELIHOOD ESTIMATOR OF AN UNKNOWN 
PARAMETER IN A SIMPLE MARKOV CHAIN 


By J. GANI 
Australian National University, Canberra, A.C.T. 


I. SUMMARY 


The paper begins with proofs of the usual theorems for the optimum properties of the 
maximum-likelihood estimator of an unknown parameter 6 which defines the transition 
probabilities p;,(@) of a simple ergodic Markov chain. By an ergodic chain is meant one for 
which, nt only is the final chain stationary, but also all possible initial states remain per- 
manently available; these conditions are sufficient to prove that the maximum-likelihood 
estimator is consistent, and asymptotically normally distributed. 

The paper proceeds to establish the form of the transition probabilities p;;(@) which admit 
a sufficient estimator of 6. To do this, the form of the likelihood function admitting a suffi- 
cient estimator when the parent distribution is discrete is first derived: this is used to obtain 
the form of the probabilities p,(@) for a multinomial distribution admitting a sufficient 
estimator of 7, and the result is finally generalized for the transition probabilitie» p,,(4) 
of the simple ergodic Markov chain. 

The paper closes with an examination of possible forms for the matrix p of transition 
probabilities p,,(9), and these are illustrated with simple examples for Markov chains with 
two and three states. 


II. THE LIKELIHOOD EQUATION FOR A SIMPLE MARKOV CHAIN 


Consider the simple Markov chain with s possible states Z,, ..., #,, and the matrix of trans- 
sition probabilities p,,(?) = Pr(#;| Z;) (i,j = 1, ...,8), which are all functions of an unknown 
parameter 0. The matrix is given by 


Pru(9) ... Drg(9) 
Ae calietin ) 
Pal9) -.- Pgg(9) 


where the transition probabilities are subject to the condition that 
bd . 
2D Pi = 1, for alls = 1,2,...,¢. (1) 
J= 


If a realization of the chain results in an observed sequence S of n + 1 states in the following 
aoe By, By, +0: By Bp 
then it is possible to write the probabilities of this sequence as 

P(S) = a(F) p,;() ... yal), (2) 


where a,(9) (i = 1,...,8), the initial probability distribution of the states Z,,...,HZ,, is 
assumed known. 





— 


or. 


It: 
the 


As 
an 


an 


Th 


wh 





J. GANI 343 
The likelihood function will be given by 
L = In¢(S) = Ina; +Inp,;+... + In py. 


Grouping together the transitions from states HL, to states E,, (i,j = 1, ...,8), and denoting 
the frequency of these transitions by n,;, we write this likelihood function as 


8 8 
L =I\na;+ 2% n,; In p;;, 
or for simplicity 











L = In a; + ¥ Ni In p,;;. (3) 
4,j 
It follows that, providing a,(4), p;;(9) are differentiable with respect to 0, the derivative of 
the the likelihood is db. da, pM dp, 1s 
ee d0 a,d0 f3p,; d0° 
fa As n increases, the second parts of (3) and (4) become dominant, and the likelihood function 
am and its derivative will be L~Dnylnp,,, (5) 
if 
imit and aL ~ nig dps a (6) 
tain The maximum-likelihood estimator 7' of 6 will in this case be given by 
ient 
dL n,, 4p; 
‘ (ets), -« ° 
ui?) dO} y 2D: dé} 7 
tion It is important to note that the n,; are not linearly independent; the number of transi- 
with tions from the state Z; to the states Z,, ..., H,, will, except for a possible end-effect, be equal 
to the number of transitions from the states Z,,..., #,, to the state Z;, so that 
xX 2%; = X Ni (8) 
j=1 r=1 
oa where the sign + indicates equality or a possible difference of 1 between the sums. As 
n increases, we may therefore accept the equations 
8 8 
YL 4; = y 2; = 2; (t= E,iccog@), (9) 
j=l r=1 
where > m= > Yu; = 2. 
i=1 i=1j=1 


III. OprmmuM PROPERTIES OF THE MAXIMUM-LIKELIHOOD ESTIMATOR 


Following closely the proots given by Cramér (1946) and Rao (1952) for continuous dis- 
wing tributions, we shall deduce that for an ergodic chain, the maximum-likelihood estimator 
T obtained as a solution of (7) has the following optimum properties: 

(a) The estimator 7' obtained as a solution of (7) is consistent. 
(b) The consistent solution of (7) is asymptotically normally distributed about the true 
| value 0: it is fully efficient. 
A further optimum property due to Huzurbazar (1948) can also be proved; it is that 
is) iS (c) Any consistent solution of (7) is such that for n > 00 the probability that the likelihood 
tend to a maximum converges to 1; the consistent solution is therefore unique. 
i 23 Biom. 42 











344 Maximum-likelihood estimator of an unknown parameter in a Markov chain 


The proof of this property for the case of the chain considered is, however, identical in 
its essentials with that given by Huzurbazar for continuous distributions; it will therefore 
be omitted. In order to establish these properties, we require the assumptions that 


dp, py, dp; 
dd” dé’ dé 











exist and are continuous for every @ in a range R including the true value 4. Before pro- 
ceeding to prove these theorems, it may be useful to mention some properties of ergodic 
chains which we shall refer to. 

For an ergodic chain, all possible initial states remain permanently available, and the 
final chain is stationary and independent of initial conditions. Although it is possible for 
some transition probabilities p,; of an ergodic chain to be zero, the stationary probabilities 
P, = Pr(£,) (i = 1....,8), which are given by the matrix equation 


p’P =P, 


together with the equation } P, = 1, where P is the column vector of elements P,, will all 
i 


be non-zero; it is clear that the P, will also be functions of the parameter @. In equation (2) 
no assumption was made about the chain being initially stationary, so that the a; are not 
necessarily equal to the P;; however, if the process is initially stationary, we can write 
a; = P, in this and the subsequent equations (3) and (4). In all that follows, we shall for 
simplicity consider an initially stationary chain; the results obtained will also hold asymp- 
totically for a chain which is not initially stationary. 

The following set of results for an ergodic chain is connected with the matrix R, with a 
transpose of form 


R’ = 


Puc per ... pygehs 
En DOR id POEUN A PE SS OE (10) 


which was introduced by Montroll (1947) and Bartlett (1951), and will later be used to obtain 
the moment generating function of the transition frequencies n,;, previously mentioned 
in equation (5). It is easy to see that the latent roots ~,(t) of this matrix, where t is the 
matrix of elements ¢,; (i,j = 1, ...,8) are given by the equation 

|R—zlI| =0 (11) 
and are continuous in t. For t = 0, this determinantal equation becomes 

|p—pzI| = 0, (12) 


with roots /,(0), ...,4,(0), not necessarily all distinct; we may, without loss of generality, 
assume that these are arranged so that their moduli are in descending order of magnitude. 
For a stationary chain, it is known that 


(0) =1, |4,(0)|<1 for the remaining r = 2,3,..., 8; 


it follows from the continuity of the roots ,(t),...,4,(t), that for t in the neighbourhood 
of t = 0, we have 
| #(t)|<|4,(t)| for all r = 2,3,...,8. 








On « 


whe 
this 


Nov 


we § 
zerc 


hain 


sal in 
efore 


pro- 
rodic 


1 the 
le for 
lities 


I] all 


n (2) 
> not 
write 
(1 for 
7mp- 


ith a 


(10) 


tain 
yned 
the 


(11) 


(12) 


ity, 
ide. 


ood 








J. GANI 345 


We prove also that for an ergodic chain, for some t in the neighbourhood of t = 0, ,(t) 
is not identically equal to 1. For, suppose j,(t)=1, then for t such that t,,+0, and ¢,; = 
for all other values of i, j, the equation (11) would give 


Pyeu—l py ss Dag 
Pa Poo—1.., Dag | 0. (13) 
Psi * oo. Ped 


On expansion, this could be written 
(Pry — 1) Cy + Pye Cat --- + PisCis = 0, 


where the C,; are cofactors of the elements in the first row and jth column. For ¢,, small, 
this is > om : 
Prt Cy + (Pu-1) Cy +Pi2C2+ --- + Piss = 9. 
Now if ¢,, = 0 also, so that t = 0, equation (13) would give 
(Pu—1W Oy t+ P22 +--+ Piss = 9; 
we see, therefore, that j,(t)=1 only if p,,C,,t,, = 0, so that p,, = 0 or C,, = 0, or both are 
zero. C,, cannot be zero, for since (0) = 1 is a simple root, then 


d 
du|P-HT|)) +o 
or, on expansion, 
—{O1, + Cyt ... + ,,} + 0, 

where C,, are cofactors of the elements in the leading diagonal of (13) when ¢,, = 0. At least 
one of these C;; is non-zero, and we may without loss of generality assume that C,, is such 
a non-zero cofactor. If, in addition, p,, is non-zero, j,(t) + 1 for at least the case when ¢,, +0 
and ¢,; = 0 for all other i, j. 

It is possible, however, that p,, be zero; if this is so, then for an ergodic chain, at least two 
of the probabilities p,», ..., 9, Will be non-zero.* From equation (12) we see that in this case 
— Oy + DygCygt+.-- + P1,Cy, = 9, 
so that it is clear that at least one of the cofactors C,, (r = 2, ...,8) which multiplies a non- 
zero probability p,, is non-zero. Let t now be such that t,,+ 0, and the remaining ¢,, are all 
zero; then in exactly the same way as before, for a small value of t,,, it can be shown that 
fM,(t)=1 only if p,,C,,t,, = 0, a condition which is absurd. So that for p,, = 0, there still 
exists a t for which y,(t)+1. It follows therefore that in general for an ergodic chain, for 

some t in the neighbourhood of t = 0, ,(t) is not identically equal to 1. 


(1) Consistency of the maximum-likelihood estimator 


If the estimator 7’ obtained as a solution of (7) is to be consistent, it is necessary that if 
4, is the real value of 0, then as n->00, 


Pr {| 7'—9,| <d}>1, (14) 

where é is any small positive number. 
* By an ergodic chain is usually understood one for which 0<p,,<1, but no py; = 1 (i,j = 1, «--» 8). 
It is possible, however, that a particular p,, = 1 for i+j; in this case the states E,, H, may be con- 


tracted, and we may therefore accept that for an ergodic chain, 0<p,,<1 for all i,j =1,...,8. Our 
statement follows immediately. 


23-2 








346 Mazximum-likelihood estimator of an unknown parameter in a Markov chain 


Consider the expansion 
dL qL\ | aL 2 (@L 
"1 do = =(zi,+ (0-99) ipa, +4(9— 60) lal, 
dL, {d*L\ (dL 
iB) Wal, We, 


(15) 


where 0, lies in (0,0)). For simplicity, we write = as nBy, nB,, nB, 





respectively, where these are obtained from (6) as 


~ 


_ i wil aot (A le: 


= (30, Rs a ae I, 
G "lp,; dp? dO dé? * p3,\ dd. 
It is simple to see that as n> 00, the expectations of the n;;, ssdapectinns of whether the 
chain is initially or only finally stationary, will be given for alli,j = 1, 2,...,8, by 
E(nj;) = nP, p;;. 


Further, since on differentiation with respect to #0, equation (1) gives 


(16) 





8 dp;; d*p;; d*p,; 
_ _ 2 
0 =” z age = 5 aes ~ 


we obtain for the expectations of nBy, nB,, n.B,, the results 














dp; 
&(B,) =n SP “pol =0 
n ( 0) md i296 . 
_ ale ay dp;;\*\ La ow 
n€(B,) = MEPL (ap te i(9,)n, (17) 
reg = (ae — Se | - 
nd (Bs )=alsrzls a) -; d6 do* I, — (A) 


where 7(@) is obviously finite and positive. We now prove that the variates n;,;n~! converge 
in probability to their expectations; from this it will follow that although the n,;;n~1 are not 
independent, B,, B,, B,, which are linear functions of these variates, also jointly converge 
in probability to their expectations. 

In order to prove the convergence in probability of the variates n,;n-! to their expecta- 
tions P;p;;, we shall evaluate their variances using a method described in Fréchet (1952, 
p. 73). A transition frequency n,; is treated as the sum of n variates X{), so that 


where X%) takes the values 1 or 0 at the rth transition, depending on whether this transition 
is or is not from the state H; to the state H;. It is clear that 


é(X9) = P:Di;, 
WV (XY = &(X9))} - {E( (X9)}? 
= Pipy—Pipi; (rv =1,2,...,0), 
XY) = (XY XY) — F (XB) (XY) 
= P, pi; Pir ss Pipi; (r,t - 1, 2,...,%; t>r), 


and @(X XX 





ain 


(15) 


nB, 


(16) 


the 


(17) 


rge 
not 
rge 


sta- 
152, 


‘ion 








J. GANI 347 


where p{;-"~” is the probability that a transition from the state HZ; to the state Z; occur in 
t—r—1 steps, and is the (j,i)th element of the matrix p’*-", the (¢—r—1)th power of the 
matrix p. 

The variance o7, of n,; is therefore given by 


o3, = r|( > xy) 
r=1 
= 5 V(X) +2 D C(XYXY) 
r=] t>r 
= n(P,p,;—P?p?;) + 2P.p3; © (piz"-»-P,), 
t>r 


where there are $n(m— 1) terms of the form (pf>’-” — P,). Now, since the chain is ergodic, 
and therefore stationary, it is known that for all t—r— 1 > 9, where n, is some finite positive 
integer, the terms | pir) — P| 


converge to zero at least as fast as the terms of a convergent geometric progression with 
a ratio g where 0 <q < 1 depends on np. If, therefore, we write the variance o?, as 
n—2 
oi; = ME Pis— Pipis) + Pipi; & (m—k—1) (pe — Fi), 
we see that o?,n~ will clearly converge to 
lim of,n— = P,p,;;— Pipi; + 2P,pi;8, = A, 
n>o 
2) 
where A is some value independent of n, and s;,= > (pi?—P,). 
k=0 
We now prove that for ergodic chains, the limit A must be non-zero. To do this, we use 
a theorem given in Fréchet (1952, pp. 86-8) which applies to the frequencies n, of (9), giving 
the number of times ina realization that the system is in state Z;. The variances 0? = W'(n,) 
of these can be shown, by a method similar to that used for the n;; above, to be such that 


lim o?n- is some finite value independent of n. The theorem proves that for an ergodic 
n—>o 


chain of the type we consider, lim o?~! cannot be zero. 
n—->@ 


In order to apply this result to our transition frequencies n;; we redefine our system of 
states in the following fashion, assuming first that the p,; are all non-zero. We define a 
system of s* states H;; (i,j = 1,2, ...,8), in which the system will be in state #;; when there 
is a transition from state H; to state H; of the original system; the new stochastic matrix 
for this system will be 


a a ae ee 0 
0 ee SAy Sak hea arte 0 
Diacchiel’ | i .-Reeseel, na Bede 0 Da Pes 





SORE ee E HEE HEHE EEHEEEEEHEEEEEEHEEEEHEHEEES 














348 Maximum-likelihood estimator of an unknown parameter in a Markov chain 


If the original stochastic matrix is written as 


By 
Pp = (a,a,...a,) =]: ], 
B, 


where a,, 8; are respectively column and row vectors, then y can be written 


Pee 


Cone e eee weeeeeeeeeree 


| | 


It is clear that ”;; will now indicate the number of times in a realization of the chain that 
the system is in the state #,,; in other words, the n;; in the new system are the analogues of 
the n, in the original system. All that remains to be proved is that if the original system is 
ergodic, so also will the new system; this is intuitively obvious, and can be easily shown by 
powering the matrix y as follows. If we write for the nth power of the matrix p 


p” = (a aly” i a”), 


it is seen directly that on multiplying the matrix y by itself, we obtain 


(hy a, Qs > a, a, 
“e le son 
Y=] Pul : fe Pisf : | Pat : fp -Post : Pere Pal ; J --s Psst: , 
a, a, Bs Os a, a, 
P, 
and similarly, since for n 00, lim a? =| : 
n—>o . 
P, 


ay” ay” PuP, «-- Des Ps 
then y*tl =i pil 2) Decl : >{ : bye, 
ay”, Ye PuP, «-- Dale 


It follows therefore by Fréchet’s theorem that lim o7, n-'+0. If a certain p,; is zero, 


n—->@ 
then in our new system the state Z;; must be eliminated, since no transition into or from this 
state is possible. It can be verified, however, that the results above will hold equally well 
in such a case. We now conclude from Tchebyshev’s theorem that the variates n,;n- 
converge in probability to their expectations since 
Pr{| n,;n-1 — P. py; | >€} < o2,/n*e? < Alne?. 

We may express the joint convergence in probability of By, B,, 2, to their expectations 
_ te hem Pr{| By| <8%, B, < —4i(0), | By| <2| K(0,) |}>1—e, 
where 6, € are any two small positive numbers. Let F denote the set of points for which 
this inequality holds; then for every point in F, we have that 

By+4B,6? <(1+| K(6,) |) 6? 

and B,d< — }i(A,) 6. 





If 6 


Pre 


the 


wh 


We 
the 
the 


we 





ain 


ms 


ich 





J. GANI | 349 
If 0 = 6,+6 are two such points in F,, then from (15) we have for these points 
1dL 
ok gee! ae 2 
nab By + 6B, + 46*B,. 


Provided that é is so defined that 
0 < $4(O) (1+| K(4,) |), 
then for 0 = 6,+4, 
1dL 


7 do (1+ | K(@,) |) {F—44(5) (1+ | K(4,) |) }8< 0, 


whereas for 0 = 0,—4, 


ldL ; ; 

db? (1+ | K(A,) |) {8+ 44(45) (1+ | K(A,) |} > 0. 
We see from (6) that dZ/d@ is a continuous function of 4 in R, and it follows therefore that 
the likelihood equation (7) uas a root in 0, + 6 with probability tending to 1. This establishes 
the consistency of the estimator T' of @. 


(2) Asymptotic normality of maximum-likelihood estimator 


Let the consistent solution of the likelihood equation be 7’; then from equation (15), 
we have that 


LjdZ\ _ 1 fab) L (aL 1 (dL) 
a oe 1 ae = ( 
n\dO\, 7 \dO J, He Oe n \d0?|, ee n \dB |, . 
or Bot (7-6) By + 4(T- "O)* Bo = Q, 
so that (T —0,) = — By B, + 4(T —9%>) Bo}. 


We have seen that the expectation of B, is zero; its variance can be evaluated, again using 
the method described in Fréchet (1952, p. 73): 


A ie rhe a. a) | 
Vv (B,) =¥ run (;- do) | 


dp;;\ \? 
= 34(2 ("2 BH) 18 
nN \py dO) ol * aie 


We treat each n,,; as the sum of the n variates XY) which take the values 1 or 0 at each of 
the n transitions r = 1, ...,, depending on whether these transitions are or are not from the 
state H; to the state £;. 
We then have 
nj; dp,;\2 P| (x XY) dp; ip 
é! ee le we tj 
| pi; 40 | 62 dO 


i,j i,j Pij 





XY) dp X%) dp,, . X dp 
ély tu eu" +2 ze| Aig SLi lm ca 
lig Diz @ Fy Pij “dO Din dé 
(X¥})? (Se) + XY} XY. dpi; 1Pim\ 
ee (Ge) +e ES a a 

1,j Pi; dé 


v 
— 
r r<t 


oy 


p> 
Y i,j lm Piz Pin dé dé | 


Xp XY Xin dp, dp, 
+2 e(7# pac Be), 
2, PPT Pi Pin d0 dO 











350 Mazximum-likelihood estimator of an unknown parameter in a Markov chain 


where >, >, ¥ indicate summation over all possible valuesr = 1,...,;i,jand/,m = l,...,8, 
r ij im 
but where >’ >’ indicate summation over those values of i, j and /,m = 1,...,8, for which 
i,j lm 


at least one of the conditions i +/, 7 +m holds. Since for these values of i,j and 1, m, we have 


that X9) Xi? = 0, and further = dp “= = 0 for all 1, then this is equal to 


pox 2F (Ge) | +2 E [EE Peat’ Pm) (5 Gear) = "Es! ae) 


ij lr Pij r<t \i,jlm ij Piz \ 40 
It follows that V (By) = —n6(B,) =n-i(p), 
B 
so that —_ 
(T' —5) V{i(Oo) n} ie vin 1(A,)} 











[_B, 708 
\=i(09) > — 250) 


has zero expectation and unit variance, for the denominator converges to 1 in probability, 
and the numerator has zero expectation and unit variance. If we can further show that B, 
is asymptotically normally distributed, then 7 wil] be asymptotically normally distributed 
with expectation 6, and variance {i(@,) n}-". 

Now from (16) we see that: B, is a linear function of the n;;, so that it is sufficient to prove 
that as n> 00, these are jointly normally distributed. This has been proved, for the ergodic 
chains considered, by Bartlett (1951), who in his work also implied some of the properties 
of maximum-likelihood estimators developed in detail in this paper. The proof is briefly 
outlined below; one difference, however, is that it is no longer necessary to postulate 
In #,(t) +0, since it was shown in §III that for the chains considered, this must be so for 
some t. 

Following Montroll (1947) and Bartlett (1951), we define the moment generating function 


of the n;; as M(t) = &(exp z tym) 
= &(exp 5 ts; = X¥)). 

If R is the square matrix whose transpose R’ is defined by (10), this is 
M(t) = 1’R"P, 


where 1’ denotes the row vector (1,...,1), and P is the column vector of the stationary 
probabilities P;. If R has the s latent roots ,(t),...,,(t), we have seen that for small t. 
#4;(t) is the single dominant root such that ~,(0) = 1. Following Frazer, Duncan & Collar 
(1947, §4-15), we have that Sylvester’s theorem gives for n >, 


R® ~ {14;(t)}" Zo(/4), 
* where the matrix Z,(,) is finite. It follows that 


M(t) ~ {u,(t)}" 1’Z(u,) P, 
so that In M(t)~nInz,(t), 


a non-zero value, since for an ergodic chain, for some small t it has been shown that ,(t) + 1. 





lary 
Hl t. 
ollar 





J. GANI 351 


M(t) is thus asymptotically equivalent to the moment generating function of a sum of 

n identical independent vector variates. Since for an ergodic chain, the n,; have variances 

of order 7 so that o?,n~ are finite and non-zero, they will tend to simultaneous normality 

as n->0o by the Central Limit Theorem for n = ¥n,; variates. It follows that By will be 
i,j 


asymptotically normally distributed. 

Finally, a general convergence theorem of Cramér (1946) enables us to conclude that 
(T' — Oo) V{t(Ao) n} is asymptotically normally distributed with zero expectation and unit 
variance; this is that if a variate £,, with distribution function f,,(x) tending to f(x) and a 
variate 7,, converging in probability to a positive constant c as n> © exist, then the dis- 


tribution function of £, 71 tends to f(cz). 
Ps , 1 fe({@L) \\> 
ince {i(0,)n}-) = id (| 


d6*\,,)) 


is the minimum possible variance, we see clearly that the consistent solution of the maxi- 
mum-likelihood equation is fully efficient. 


IV. SUFFICIENCY CONDITIONS 


A further general theorem of maximum-likelihood estimation, proved under certain con- 
ditions by Cramér (1946, § 33-2) for continuous distributions, applies equally with minor 
modifications to the case of the Markov chain. It is that: 

If a sufficient estimator 7’ of the parameter 6 exists, any solution of the likelihood equation 
will be a function of 7’. In obtaining the maximum-likelihood estimator T' of 6, the para- 
meter defining the transition probabilities p,,(@) of a Markov chain, it may therefore be of 
interest to determine the form of the transition probabilities which admit a sufficient 
statistic for 0. 

Koopman (1936) and Pitman (1936) have given the general form of continuous distribu- 
tions admitting a sufficient statistic, but although it has been obvious that a similar form 
exists for discrete distributions, no account of the proof for this appears to be available. 
We briefly outline the proof, use it to obtain the form of the probabilities p,(9), ..., p;,(9), 
in the particular case of the multinomial distribution, and finally generalize our results in 
the case of the simple ergodic Markov chain. 


(1) The form of the likelihood function admitting a sufficient estimator T of 0. 
for discrete parent distributions with constant variate intervals 

We consider those discrete distributions p(x, @) for which the variate x is defined at equal 
intervals. We can then, without loss of generality, assume the interval to be unity so that 
x takes consecutive integral values in any specific range, finite or infinite; this includes the 
standard discrete distributions for which z is a frequency, as well as others for which z may 
take positive or negative integral values. 

Let x,,...,2, be m independent observations of a variate x with the discrete distribution 
p(x, 0), where x takes a set of consecutive integral values in a given range. The probability 


of this sample i 
ea clan (ay, ...,%,; 9) = p(x,,0)... p(x, A). 
If this is to admit a sufficient estimator 7' of 6, then the factorizability condition 


O(a, ...,X_; 9) = f(a, ...,2,) F(T, A) 











352 Mazximum-likelihood estimator of an unknown parameter in a Markov chain 
must hold, or for the likelihood function L = In¢, 
L = Inf(2,...,2,)+In F(T, 9). (19) 


Assuming that the functions p(x;, 0), F(7', 0) are differentiable with respect to 0, we obtain 
the equation a 3 
babe 5 y : => { = 
& 5g tn plm, ap in F(T, 0)} G(T, @). (20) 
Since this holds for all values of # within the range allowed, a particular value 0, lying in 
this range will give, when substituted in this equation, 


which connects 7' with the statistic 


,(é@ 
=-V j= VIK— - 
u = LD u(z;) (<5 {In p(a;, 0)}) 


v 


0 

Now since sufficiency is a property which holds for all values of the sample 7, ...,z,, we may 
allow x;, a particular discrete sample value, to increase by 1, while the remaining 2,, Xp, ..., 
X;_1,Xj41,---»%,, remain unchanged. We can then difference equation (21) with respect to 


x, to obtain A, = A,,u(2,) = A,,9(7). 
Similarly for equation (20) 


A,, In p(a;.0) = A,,G(T.9), 


21 06 


so that for all values of i, we have the equation 


o 
A. a0 In P(x;, 6) 








=a A,,Q(T, a) (22) 
A,,ua,) A,,9(7) 
Now if G(T, 0) and g(T) are assumed differentiable with respect to T, we have that 
A,,A(T,.0) = G(T (xy, ...,%j4,%;, + 1, 244, ---, Ly), 9) — A(T (ay, ..., Xj, ---, Lp), 8) 
= G(T, 0) — G(T), 0) 
_ (7 —p.) 
— (7, T)) \aT |p, 
0G) 
= (A,,T) “ ’ 
\aT'| », 
where 7, is some function 7,(x,,...,;,...,%,,; 9) of the sample values and of @ such that 


T, < T,< T,. Further, since the estimator 7' is a symmetrical function of the x;, the function 
T, will be the same in the case of every differencing with respect to each x;, i = 1,...,”. 


Similarly, A. ott) th 
Pt Py 


ay 
RI: 
2 
¢ 


Ns o00y Rhy o00p gs Op) 


where Te < Tylatg, -- 25 By «209 Bq; Og) ST. 
‘It follows that equation (22) can be written for all values of ¢ as 
0 (0G 
71 3p i P(%i» 9) a (OT) 74 (ay, ..-0203 0) 
A,,u(%;) | dg \ ‘ 
\dT} T 2 (xy, +--+. In; Oo) 


A 














9) 


0) 


vy 


to 


2) 


at 

















J. Gani 353 
which is equal to a fixed value, not depending on 7. Hence 
a: =e (x,,4) pes 6 om 6 
1 00 PX, fs £206 P\X2, ) 3 re In 00 np(z,, ) 
A,, U(2;) A,, U(X) xe A,,Wt,). ’ 


and this can only be so if these are all equal to a function of 6 only, K,(@). It follows from 
ae AUT, 0) = KO) A,,9(7), 
so that G(T, 0) = K,(0)9(T) + K,(8). 
On integrating this with respect to 0, we obtain 

ln #(T',0) = A,(9)9(T') + A,(A) + A(T), 
so that the likelihood (19) is of the form 

L = Inf(x,, ...,2,) + A,(9) 9(T’) + AQ(4), (23) 
and the probability of the sample is of the form 


P(X, +++) ys 9) = f(y, ---,%n) exp [Ay (9) g(T) + A2(9)]. 


(2) Form of the probabilities p,(@) for a multinomial distribution 
admitting a sufficient estimator T of 0 


Consider the k mutually exclusive events #,, Z,,...,H£,, with non-zero probabilities 


k 

p,(9), ..., (9) such that ¥ p,(@) = 1.' In a total of n independent trials in which there are 
i=1 

x, occurrences of the events £,, x, of the event Z,, ..., and 2, of the event H,, the probability 

of the sample of frequencies 2,, 2, ..., 2, is 


P(Hy, «++ Xp; O) = py(A) ... DEA). 


For this to admit a sufficient estimator, the likelihood ZL = In¢ must be of the form (23), 
so that k 
L= x x, In p,(O) = Inf (xy, ..., 2%) + Ay(A) 9(T) + AQ(8). 
i= 
k 
We note in this case that A,(9) involves the number of trials = > 2x;so that it can be written 
i=1 


where A,(9) is a function of 9 only. Also from the form of the likelihood, we see that the first 
term can only be of the form 


k 
Inf (2, ...,%,) = S A;2;. 
=1 


We can therefore write the likelihood function admitting a sufficient estimator T' of @ as 


L = Sx, np) = 5 Aya, +A,(0) G(T) +A) D2, (24) 


which holds for all values of n, and of the frequency values 2,,...,2,. If, in a total of n+1 
independent trials, the frequency of each event H,, Z,,..., #;_1, #;41,..., 2, apart from the 











354 Maximum-likelihood estimator of an unknown parameter in a Markov chain 


event H; remains unchanged, while the event H; occurs x;+1 times, then we may 
difference equation (24) with respect to x;, to obtain 

In p,(@) = A; +A,(9) A,,9(T) +A,(9), 
or {A,(4)}“ {In p,(9) — A; —A_(9)} = A,,9(T). 
That is, a function of 0 only equals a function of the sample values x;, which is possible only 
if both equal a constant K;,. 


It follows that 
p(9) = exp[K; A,(A) +A,(9) + Aj] 


= a,exp[K;A,(9)+A,(9)] (¢ = 1,...,4), (25) 


where A; = Inq,, is the form of the probability allowing a sufficient estimator 7’ of 0. 
A condition to which the probabilities p,(@) are subject is 


ina 


p(9) = 1, 
v 
so that in our case, from (25) 


~ a,exp[K,;A,(0)+A,(4)] = 1, 
or x a, exp [K;A,(@)] = exp[—A,(4)]. 
An alternative way of writing the probabilities p,(0) is therefore 
p;(9) = «,exp[K;A,(4)] { a, exp[K;A,(A)]}* 
= 0, AEA) {3 0, AE(O)}, (26) 
where a6) = exp[A,()]. 


We can now tell immediately in any practical case whether it is possible to obtain a suffi- 
cient estimator or not by an examination of the form of the multinomial probabilities. Three 
simple illustrations from trinomial distributions are given as examples. 

Example IV 2-1. The trinomial distribution with probabilities 


Py = O40? +203}, pp, = G0 + 02 +203), py = 20310 + 62 + 20%)-1, 


of the form (26) with «,; equal to 1, 1, 2 respectively and K, equal to 1, 2,3, where A,(@) = 4 
will give an estimator of @ which is sufficient. 
Example IV. 2-2. Probabilities of the type 


J 


P=9, pp= 20, pp =1-—36, 
which can be written in the form (25) 
p,; = «,exp[K, In 0(1 — 30)-1 + In (1 — 38)], 


where a, equal 1,2,1, and K; equal 1, 1,0, respectively, will also give an estimator of 0 
. which is sufficient. 


Example IV. 2-3. However, probabilities of the form 
Pi=4(2+9), pp=H(1-4), ps = 39, 


which occur in genetics, cannot be written in the forms (25) or (26), and can be verified to 
admit no sufficient estimator of 0. 








in 





ay 


5) 


6) 


ee 





J. GANI 355 


(3) Form of the transition probabilities p;,(8) for a simple ergodic 
Markov chain admitting a sufficient estimator T of 0 


Consider the realization of the simple ergodic Markov chain with s possible states H,,..., H, 
which results in the observed sequence S of the n+ 1 states £,, E;, ..., Hy, H;, in this order. 
Assuming the initial state EZ; to be fixed,* the probability distribution of the sequence S is 
given by (2), and the likelihood can be written as equation (3). An ergodic chain (see 


footnote, § III) is such that for any row i of the transition matrix, and for all values of j, 
0 < pi;(9) <1, 


so that, in particular, for i = 7 no state E; is an absorption state; some, though not all, 
p;;(9) in a row may equal zero, subject to the usual conditions (1) that 


> p() = 1 (i = 1,2,...,8). 
j=1 


Nowif the chain admits a sufficient estimator T' of 0, we see that it may be written, much 
in the same way as for the multinomial case, 


L= Ynyjlnp;,(9) = 2 Assis + Ay(9) 9(7) +A,(9) 2 Miss (27) 
1,7 1, in 


where the A;,; are constants, A,(@) is equal to the function Aa(O) Dis in this case, and 
1, 
um jun. 

Before proceeding, as in the similar case of the multinomial, to increase the number of 
trials in the realization of the chain by one, so as to allow this equation to be differenced 
with respect to the n,; in order to obtain the non-zero values of the p,;(@), we must consider 
some difficulties which this method presents. Since to any zero transition probability p,;(4) 
there will necessarily correspond a zero value of the transition frequency n,,, neither will 
appear in the likelihood function (27). We are concerned with the n,; corresponding to non- 
zero values of p;;(9), and differencing occurs only with respect to these; however, owing to 
the set of relations (8) for the sums of transition frequencies, it is not always possible in 
increasing the number of trials in a realization of the chain by 1, from n+1 to n+2, to 
increase each of the n;; associated with a non-zero transition probability p,,(@) singly by 1, 
while leaving the values of the remaining n,; unchanged. 

As a simple example to illustrate this point, we consider the case of a chain with two 
states H, and E,, for which a possible realization of three trials, starting with the fixed 
state H,, results in the sequence EZ,, H,, H,. Assuming all the transition probabilities to 


* We are specifically considering the case of sufficiency for a distribution conditional on the given 
initial state E,;. In the general case where any state EZ; may occur initially, the likelihood is 


8 

L= X 2,lna,+Un,; ln py, 
i=1 i,j 

where the variates x; are 0 or 1 according to whether the state £; is or is not the initial state. In addi- 

tion to the usual conditions for the n,;, the x; are subject to the condition that 


8 
> x; a l. 
t=1 
This more difficult case will not be considered. 











356 Maximum-likelihood estimator of an unknown parameter in a Markov chain 


have non-zero values, we see immediately that the transition frequency matrix with 


elements n,; is oe a (( ) 
n= Ln ; 
Ne, Ng 9 9 


a realization of four trials starting with Z, will give, among others, the following transition 


frequency matrices: ‘yy |] Yr 9 ry 
(( o}’ \1 of’ \o 1)’ 


for the sequences H,, E,, #,, H,; E,,H,,H,,H#,; and E,,H,,H,,H,. Here we see that n,,, 
No1;Ngq have singly been increased by 1, while the remaining n,; remain constant; it is, 
however, in no way possible, starting with the given matrix n, to increase 74. by 1, while 
leaving 4, %1, %2 unchanged. 

In order to resolve this difficulty, we use the fact that for an ergodic chain, providing the 
number of trials n + 1 is sufficiently large in a realization starting with the given state Z,, 
it is possible to end with any one of the states Z,,...,#,. In our particular case, assuming 
a sufficiently large number of trials, the realization of the chain leads to the sequence S 
which starts with the fixed state #; and ends with the state H,; a further trial may result in 
the sequence S’ with an additional state Z,, for all r for which p,,(@) + 0. The probability of 


this sequence is , 
aieuare G(8') = p44(9) --- Pel) Pl), 
and the likelihood function L’ = In ¢(S’) is identical with L = In¢(S) except that the 


transition frequency ™, in it is increased to m,+1. This allows us to difference the 
equation (27) with respect to ,, for all r for which p,,+ 0, in order +o obtain 


In p;,(9) pen A, + A,(9) An, G(T) + A,(9), 
where A,,,,9(7') is some functior. of the transition frequencies n,;. Now since 
{A,(9)}? {In p, wU A,(9) = Ay} ai A,,9(T), 


that is, a function of 6 equals a function of the n,;, then both must equal a constant K,,. 
This allows us to write for all r for which p,.+0, 


Pr) = %, exp [K,,A,(9) + A,(9)], 


where A;, = In. If we further admit zero values of ,, we may accept this equation as the 
general form for both zero and non-zero values of p,, (r = 1, 2,...,8). Since, for a sufficiently 
large number of trials in a realization of the chain starting with the given state Z,, the final 
state HZ, may be such that / can take any of the values 1, 2, ..., s, it follows that we may write 
as the general form of the transition probabilities p,,(@) admitting a sufficient estimator 


T of 0, the equatio 
ail Bis() = oy exP[K yyA,() +A4(9)]- (28) 
This general form of the p,,;(@) has been obtained for sufficiently large values of n = Yn, 
i,j 


this means that in this case the necessary condition that a sufficient estimator of 0 exists 
is that the p;,(@) be of form (28). It is easily verified that the sufficient condition also follows; 
that if the p,,(9) are of the form (28), the estimator of @ is sufficient. This, however, holds 
generally, irrespective of the size of n. It is of interest to note that if we restrict ourselves 
to the case where all the p,,(9) are non-zero, all values of n> 1 are sufficiently large for the 





or 


If 


or 


ill 








J. GANI 357 


necessary condition to hold, since any final state Z, may be reached in a realization of one 
or more trials starting with the given state H;. This means that for the ergodic chain with 
non-zero p;;(9), a sufficient estimator of # exists for any number of trials n> 1, if and only 
if the p,,;(9) are of the form (28). 


The transition probabilities (28) are subject to the condition (1) that 5 Piz = 1, so that 
j=1 
x a, exp [K;;A,(9)] = exp[—A,(9)], 
j= 
or x a,,exp[K;;A,(4)] = = a,,exp[K,;A,(9)] (i,r = 1,2,...,8). (29) 
Sis ; ae 


If we write A,(@) = exp[A,(@)], we see that the p,,(@) can also be put in the form 
Bis) = 4, AK9(O) {X a, AKWO)}—, (30) 
I 


where {¥ «;;Af#()}-? has the same value for all rows 7 = 1, 2,...,8. 
J 


Two possible cases now exist for those exponents among K;;,, Kj, ..., K;,, associated with 
the non-zero probabilities among 7,,, Pj, -.-, Piz, in the ith row of the transition probability 
matrix p; the first in which all such exponents are distinct, and the second in which only 
some such exponents are distinct. For simplicity, we assume that in the ith row, all tran- 
sition probabilities p,,(@) (j = 1, 2, ...,8) are non-zero; the slightly more general case where 
some p,;(9) may be zero follows without difficulty. 

Consider the first case, where for the particular row 7 there are s exponents K,,,, all distinct, 
associated with the s non-zero transition probabilities p,,(@) (j = 1,2,...,8); then, from 
equation (29), it follows that for any other row r, 


z Da, Ake = D Hj AK. (31) 
7 j 


This is possible only if the exponents K;; and K,; (j = 1, 2, ...,8) have the same s distinct 
values, though these may be arranged in different orders. This means that there is only 
a single set of s distinct values 


Oy A a, AF}, ..., mig AEE 0, AFH}, 


for the transition probabilities of the matrix p, and these must appear, possibly in different 
orders, in every row of the matrix. 

Two simple examples for Markov chains with two and three states respectively will 
illustrate the previous points: 

Example IV.3-1. The transition probability matrix of the form 


(0 1-0 
P= 11-0" 6 
will always provide a sufficient estimator of 0, for we may write p,,(@)(j = 1, 2) in the form 


ai P,;(9) = a; exp [K,, In (1 — 6) +In(1—94)], 


where the values of «,,, %,,and K,,, K,,are 1, land 1, 0 respectively. The second row consists 
of the same transition probabilities in a different order. 











358 Maximum-likelihood estimator of an unknown parameter in a Markov chain 


Example IV. 3-2. 
20(20+67+64) (20+ 47 +63) = 3(20 + 62 +. 68) -1 
p= (2 +627+6)-! 26(20+67+63)-1  2(20 + 62 + a) 
62(26 + 67+68)-?  3(20 + 674+ 6)" = -20(20 + 62 + 68)—1 


also provides a sufficient estimator of 0, for in (30) we have A,(@) = 0, and the values of 
a,,and K,; are 2,1,1 and 1, 2, 3 respectively. The second and third rows consist of the same 
values for the transition probabilities, but in different orders. 

Consider now the second case for the exponents. Here, for the particular row i, the 
s exponents K,, are not all distinct; let the first k exponents (k <s) be identical 


Kyn= Kg: og=K 


and the remaining K;, ,.,,...,K;, be distinct, so that there are altogether s—k +1 distinct 
values for the exponents, and the transition probabilities for row i are therefore 

-{ 2 Te ae eT } 

Dii(9) = a,,Af = ay AEH (J bard 1, 2, a. 
\i=1 
Gar hueridati, (32 
pf) = ayAEO Bay AFel (J = b+1,...58). 
j= 


(The slightly more general case of several groups of exponents with identical values, and 
the remaining exponents distinct 


Ki, = Kin =.= Ki, = KY, 
Ki esr = Kinse = --» = Kin = K, 
Ku Kj ur ++ # Kis, 


presents no difficulty different from those met in the simpler case mentioned, and will not 
be considered.) Now since for any row r equation (31) holds, there are s—k+1 distinct 
values of K,; which are identical with the distinct values of the K;;, so that in every row of 
the stochastic matrix p we have, apart from coefficients, transition probabilities of only 
the following s—k+ 1 distinct forms 


AMT ay AK}, Afors{WasAfo}, ..., Afe{D a,AFi}-, (33) 
j j j 


appearing in various arrangements. The s coefficients associated with these s—k+1 
distinct forms for the transition probabilities may also differ from row to row, but are sub- 
ject to the condition which follows from (31) that for all rows r = 1, 2, ...,8, the sums of the 
coefficients for each distinct form in (33) must equal the fixed values 


k 
~, Oss, Ae eras ser Bigs (34) 
respectively. 
Two simple examples of this case for Markov chains with three states follow. 
Example IV. 3-3. 
20(36 + 68)-4 6(30+64)-!  63(30 + 6) 
p= | 405(30+ 68)" -263(30+ 65) = 30(30+ a) : 
130(30+ 6) = 6(30+65)-1 40(30+ 68)" 











wher 
appe 


The 
cond 
a sul 





n 


of 


1€ 











J. GANI 359 


where in equation (32), A,(9) = 0, and the two distinct forms for the transition probabilities 
appearing in each row are 6(30 +63), 65(30 +.63)-1, 

The values of the coefficients for the form 6(30 + 6%)—! vary from row to row, subject to the 
condition (34) that their sum is always equal to 3, while those for the form 6°(30 + 6%)—1 have 


a sum always equal to 1. 
Example IV. 3-4. 


p=| 30 30 «1-0 
(1-9) 3(1-0) @ 


where in equation (32), A,(?) = 0(1—@)-!, and the two distinct forms for the transition 
probabilities appearing in each row are 


{(1—6)-}~4(1—@), K,; = 0,1. 


Again the values of the coefficients for 6 vary from row to row, subject to condition (34) that 
their sum is always equal to 1, while those for (1 —@) also have a sum equal to 1. 


40 1-6 “ 


I am greatly indebted to Prof. P. A. P. Moran for suggesting the field of parameter estima- 
tion in simple Markov chains as a subject for research, and for his useful critical comments 
throughout all stages of the work. I should also like to thank the referee for several helpful 
suggestions. 


REFERENCES 


BartLett, M. 8. (1951). The frequency goodness of fit test for probability chains. Proc. Camb. Phil. 
Soc. 47, 86-95. 

Crameér, H. (1946). Mathematical Methods of Statistics. Princeton University Press. 

Frazer, R. A., Duncan, W. J. & Cotzar, A. R. (1947). Elementary Matrices. New York: Macmillan. 

Frecuet, M. (1952). Traité du Calcul des Probabilités et de ses Applications, tome 2, fasc. 11. Gauthier- 
Villars. 

Huzurpazar, V. 8. (1948). The likelihood equation, consistency and maxima of the likelihood 
function. Ann. Eugen., Lond., 14, 185-200. 

Koorman, B. O. (1936). On distributions admitting a sufficient statistic. Trans. Amer. Math. Soc. 
39, 399-409. 

Montrott, E. W. (1947). On the theory of Markoff chains. Ann. Math. Statist. 18, 18-36. 

Prrman, E. J. G. (1936). Sufficient statistics and intrinsic accuracy. Proc. Camb. Phil. Soc. 32, 567-79 

Rao, C. R. (1952). Advanced Statistical Methods in Biometric Research. New York: John Wiley. 


24 Biom. 42 











[ 360 ] 


SIGNIFICANCE TESTS FOR DISCRIMINANT FUNCTIONS AND 
LINEAR FUNCTIONAL RELATIONSHIPS 


By E. J. WILLIAMS 
Division of Mathematical Statistics, C.S.I.R.O., Melbourne 


1. IytTRODUCTION 


In previous papers (Bartlett, 1951; Williams, 1952a) certain exact tests for the adequacy 
of a hypothetical discriminant function were derived. Later papers (Williams, 19526, 
1952c, 1953) showed how these tests could be applied in a number of situations of practical 
usefulness. The first object of the present paper is to extend the work in the above-men- 
tioned papers and to show how the results obtained may be interpreted in terms of multiple 
linear regression. The calculations may indeed be carried out in the manner of a covariance 
analysis. The second object is to develop, along the same lines, exact tests for an assumed 
linear relationship among variables; this is a problem which has been discussed in various 
contexts by Koopmans (1937), Tintner (1945, 1946, 1950), Geary (1948, 1949), Bartlett 
(1948), Anderson (1951) and others. Since the question of determining underlying relation- 
ships has been given considerable attention in the literature from a number of different 
points of view, the opportunity is taken also to discuss and to attempt to unify the different 
approaches niade—the use of information provided by instrumental variates, by grouping 
of the data, and by higher moments. 

The reason for discussing discriminant functions and functional relationships in the 
same paper is because the two problems are really different aspects of the same problem. 
This has been well demonstrated by Geary (1948). If a single discriminant function is 
assumed adequate to describe differences among a number of p-variate populations, this 
assumption js equivalent to assuming that there exist p — 1 linear relations among the means 
for the p variates; the means then lie on a line. In general, postulating that the differences 
among the populations are described by r discriminant functions is equivalent to postulating 
p-—r linear relationships (provided always that the number of populations considered is 
not less than p). The quantity r, the number of dimensions in which the population means 
lie, may be called the rank of the populations, and p—r the degeneracy. 

Thus the test for a single linear relationship is equivalent to the tes: for the adequacy of 
p — 1 discriminant functions. In deriving significance tests for either a discriminant function 
or a linear relationship, the same principle is applied, though the function being tested 
has a different role in the two cases, and thus enters differently into the tests. In the simple 
bivariate case, the test for a linear relationship is exactly the same as the test for the 
discriminant function which is orthogonal to it. 

The problems of this paper have been framed above in terms of an analysis of variance 

‘model, for testing the significance of differences between populations. This has been done 
in order to link them with those discussed in the earlier work (Williams, 1952a,6). A more 
general specification is in terms of a regression model, wherein the interrelationships between 
a set of p variates and another set of q variates are investigated. In such a model a dis- 
criminant function is better described as a canonical variate. Throughout the remainder of 











= @©O es SS ss 


~- ee 4 4 














E. J. WritiaMs 361 


this paper the problems will be posed in terms of the regression model but interpreted also 
in terms of analysis of variance. Though both specifications are formally equivalent it is 
easier sometimes to work with one and sometimes with the other. 

Bartlett (1951) has shown that the different aspects of the adequacy of a discriminant 
function may be tested by means of a factorization of a certain determinantal ratio. A general 
calculus of such factorizations is developed, and it is shown how it can be applied in a 
number of the tests described in this paper. 

The method of treatment used earlier (Williams, 1952a) in deriving exact significance tests 
was to frame the questions asked of the data somewhat differently from those usually posed. 
A hypothetical discriminant function being proposed, the question is whether this function 
is concordant with the data. Since the hypothetical discriminant function leads to a sufficient 
statistic for the unknown population parameter, tests can be derived which are independent 
of the value of the parameter. These tests are the counterparts of those based on the latent 
roots (largest or smallest) of a matrix, but have the advantage of being exact, even in small 
samples. The same approach is adopted in this paper, in deriving tests both for discriminant 
functions and functional relationships. 

The most important practical results given in the present paper are that the tests pre- 
viously given for the adequacy of a hypothetical discriminant function, and the tests now 
given for a functional relationship, can be reduced to tests derived by a covariance analysis 
of the data. The computations are thereby simplified because it is not necessary to deter- 
mine the latent roots of a matrix equation, the test functions being expressed either as 
ordinary adjusted sums of squares in a covariance analysis, or as the ratios of determinants 
of sums of squares and products. If, however, the “ best’ discriminant function yielded by 
the sample (or the ‘best’ linear relationship) is to be evaluated, it is still necessary to 
evaluate the largest (smallest) latent root and the corresponding latent vector. 


Il. Novation, TERMINOLOGY AND PRELIMINARY RESULTS 


Since this paper covers a rather wide field it has been thought desirable to set out clearly 
the notation to be used throughout. While an attempt has been made to use notation in 
conformity with previous work on the subject, some departures have been made in the 
interests of clarity. 

We consider two sets of variates, 


XxX; (¢=1,2,...,p), 
Y; (J = 1,2,...9), 


measured on each of n + | individuals. The X,; will be assumed to have a non-singular joint 
normal distribution, with linear regression on the Y;. Formally, this is equivalent to q+ 1 
populations with the same p-variate normal distribution of values about the means in 
each, so that the variation in the sample may be analysed into the q degrees of freedom 
between groups and the n —q within groups. 

The discrimination problem is to determine from a set of data, either that linear function 
or set of linear functions of the X; which have greatest correlation with the Y;, or the adequacy 
of a set of given discriminant functions. On the other hand. the problem of determining 
functional relationships is to find the set of linear functions, if any, of the X; which are 
uncorrelated with the Y;, or to test a given set of such functions for their correlation with 


24-2 











362 Tests for discriminant functions and linear functional relationships 


the Y;. Linear functions which are uncorrelated with the Y;, either by hypothesis or in the 
sample, will be called null functions, and their sample values null variates. 

Although from the formal correlation point of view there is a symmetry between the 
X;, and the Y;, the tests of significance developed here are always of a function (either a 
discriminant function or a null function) of the variables of one of the sets (in this paper 
the X;), and so are not symmetrical in the two sets. In this respect the present approach 
differs from that based on the canonical correlations alone, in which the magnitudes of the 
canonical correlations are used to decide the number of discriminant functions (or, in other 
words, the rank and degeneracy of the population correlations). Reiersal (1945) emphasizes 
the distinction between the two sets of variates by designating thr Y; ‘instrumental’ 
variates to distinguish them from the ‘investigational’ variates X;. 

The canonical variates of the two sets will be denoted by z,, y;, and the corresponding 
squared canonical correlations by 0;. The hypothetical discriminant function will be denoted 
by &, and its squared multiple correlation with the Y;, or discriminant ratio, by A. 

The tests of significance for a discriminant function have been shown in earlier papers 
to depend only on the original discriminant ratios 0; and the new discriminant ratios after 
the effect of £ has been eliminated by covariance, which will be denoted by ¢;. 

It will be shown similarly in this paper that the tests for a linear functional relationship, 
that is, for a given null function, depend on sets of original and new latent roots; the new 
latent roots which will be denoted by 4; are, however. defined differently from those used 
in discriminant analysis. 

The following notation will be used: 

Sums of squares and products: 


t», sum of products of X, and X; 

Pi; sum of products of X, and Y; 

u,, sum of products of Y; and Y, 

b,; sum of products of X, and X; for regression or between-group line of 
analysis of variance 

sum of products of X, and X;, for residuals from regresssion or within 
groups. 

Also t,; sum of products of £ and X; 

sum of squares of & 


and similarly for other sums of squares and products with the suffix é. 
The corresponding matrices will be denoted by capitals. Then the following results are 
readily obtained, relating the different expressions occurring in the tests of significance: 
B+We=T, 
B= PU-'P’, | B\/|7| = 114, 
W=T-PU"P’, |Wj/|7| = M(1-4,). 
.The matrix of sums of squares and products of all the p+q variates will be denoted by S: 
g Ae 
pee OS 
Armed with these results, we are in a position to set out expressions in terms either of the 
regression model or the analysis of variance model. 











Aft 


Th 


Th 
ha 





the 


the 


ich 
the 


of 


re 





In the derivation of tests of significance by means of covariance analysis, use is made of 
sums of squares and products of the X, after adjustment by certain covariance variates. 
Since there seems to be no consistent terminology for these quantities, we propose to use 
the term ‘adjusted’ only for quantities which are adjusted without any loss of degrees of 
freedom, and the term ‘reduced’ for quantities for which the degrees of freedom are reduced 
by the number of covariance variates. Thus, the analysis of variance of any X; would give 


E. J. WrLuiaMs 


the following partition of degrees of freedom: 











Degrees of freedom 
-—— 
Regression q 
Residue! n-—q 
Total n 














After adjustment by r covariance variates, the analysis would give 














| 
| Degrees of freedom 
Regression, adjusted q 
Residual, reduced n—q-r 
— ———E————E 
| Total, reduced | n—?r 
| 








The adjusted regression sum of squares may, if required, be partitioned as follows: 





| 
| 


Degrees of freedom 
| 


| 


Difference of regressions on covariance variates r 
Regression, reduced q-r 
| 
| 
| 
Regression, adjusted q 





The overall likelihood criterion for testing the rank of the population from which a sample 


Ill. THE FACTORIZATION OF DETERMINANTAL RATIOS 


has been drawn is, in terms of the sample latent roots, equal to 


p 
II (1-4;). 


i=1 











364 Tests for discriminant functions and linear functional relationships 


In terms of determinants of sums of squares and products it may also be expressed as 
[WI P| =|7—PUP"|/| P| 
[S| 
= ‘ (1) 
[7||0| 

and so may be represented as a determinantal ratio without calculation of the latent roots. 
The criterion may be expressed as the ratio of two determinants of order p, with n —q and 
n degrees of freedom respectively; and since the X; and the Y; enter symmetrically into the 
expression, it may also be expressed as the ratio of two determinants of order ¢, with degrees 
of freedom n—p and n. It will accordingly be denoted 


(n; p,q) 
to indicate its dependence on the three sets of degrees of freedom. In particular, for p = 1, 
the ratio of a residual sum of squares with n — q degrees of freedom to a total sum of squares 
with n degrees of freedom is denoted (n; 1,q). 


Now if s is an integer less than q, a sum of squares with g degrees of freedom can be par- 
titioned into two independent sums of squares with s and g—s degrees of freedom. Corre- 
sponding to this partition, the ratio of sums of squares with n —q and n degrees of freedom 
can be factorized into two independent factors: one with n—s and n degrees of freedom, 
the other with n—gq and n—s degrees of freedom; so that 


(n; 1,q) = (n; 1,8) (n—8; 1,q—8). (2) 

In exactly the same way, determinantal ratios may be factorized, giving, for example, 
(n; p,q) = (n; p,8)(n—8; p,g—8), (3) 
or, more generally, = (n; p,s)(n—s;7r,q—8)(n—r—8; p—?r,q—8). (4) 


These results are the basis of the factorizations of likelihood criteria given by Bartlett 
(1951), and will be applied repeatedly throughout this paper. 

As in univariate analysis, the first factor in (3) is seen to be the likelihood criterion for 
the simple effect of the s variables, the remaining g — s being ignored, while the second factor 
corresponds to the partial effect of the g—s variables after the elimination of the first s. 
The partial factor is accordingly the one which is to be used in significance tests. Just as in 
univariate analysis, there are always two alternative factorizations, depending on which 
set out of the ¢ variables is to be eliminated and which is to be the subject of test. Thus 


(n; p.4) = (m; p, 8) (m—8; p,g—8) 
= (n—q +8; p,8)(n; p,q—8). (5) 
When p = 1, the tests for the partial factors may be simultaneously set out in the form of 
an analysis of variance: 





| 
| 





' | 

Effect | Degrees of freedom Sum of squares | 

| 

| | 

Partial s variates 8 1—(n; 1, 8) | 
Partial g—<s variates q-8 1—(n; 1, g—8) 





Total n 





| 
| 

Residual n—q | (n; 1, 9) 
| 











sc 





E. J. WitLiaMs 365 
The possibility of thus arranging the tests arises from the fact that (for example) 


1—(n;1,8) _1—(n—q+8; 1,8) 
(n;1,q) — (n—q+8; 1,8) 





8 


ws ae n—@ (6) 


The moments of the likelihood criterion may be determined by means of this factoriza- 
tion. Taking s = 1 and factorizing successively, we have 


(n; p,q) = (”; p,1)(m—1; p, 1) (m— 2; p,1)...(n—q+]1; p, 1). 
It is readily shown that the tth moment about zero of (n; p, 1) is 
I'[4(n — p) +t] P($n) 
P[4(n—p)] T(4n+t)’ 


so that the moments of the likelihood criterion are given by 


ne p.gy ae Pep) +41 Tn) 
Hes PO = TE Piya —p— I] TK) +1 @ 





E{(n; p, 1] = 





(cf. Bartlett (1938), equation (26)). 
An important particular case is the limiting one when n tends to infinity. Writing 


(p, q) = lim (n; P; q)”", 
n—>o 
we find that the factorization (3) becomes 


(p,9) = (p, 8) (p,9-8), (8) 


and it can be shown that —log (p,q) is distributed as a sum of squares with pq degrees of 
freedom (cf. Williams, 1952a, p. 26). 

When n tends to infinity, the two factorizations given above are identical, and are equi- 
valent to a simple partition of a sum of squares—as is, indeed, evident from the analysis 
set out above for the particular case p = 1. 

An application. As a simple example of the application of these methods we consider the 
case where r of the population latent roots are known to be unity, and the vanishing of 
the remaining population roots is under test. 

If r population roots are unity, it follows that 6,,6,,...,0, are all unity, and that the 
corresponding canonical variates 2,,2%,...,2, are equal (apart from possible changes of 
sign) to ¥;, Yo, -.-, Y,, the canonical variates of the second set. Therefore, in order to examine 
the residual variaiion of the z;, the p variates of the first set may be replaced by any p—r 
of them not linearly dependent on the first r of them. Each of these variates has n—q 
residual degrees of freedom. For their regression on the y;, the variates are restricted by 
the condition of orthogonality with the first rz; and hence with the first ry,;, so that the 
regression has but g—r degrees of freedom. Accordingly, the residual likelihood criterion 
for testing the existence of the non-unit latent roots, namely, 


Dp 
II (1-4) (9) 


i=r+1 











366 Tests for discriminant functions and linear functional relationships 


is a (n—r; p—r,q—r) variable. In particular, if r=p—1 (p<q), the single sample latent 
root 6, may be tested by an F-test: 


= (n m= q) 6, 
(q—p+1)(1-6,)’ 
with g—p+1 and n—q degrees of freedom. 
Corresponding results hold when q < p. 
These results have practical application where, even if the first r population canonical 
correlations are not unity, the sample latent roots are close to unity. Then the elimination of 
the corresponding sample canonical variates leads to a residual likelihood criterion (9) 


with approximately the distribution given. Such a result has already been discussed by 
Bartlett (1947) and others. 





(10) 


IV. REVIEW AND EXTENSION OF PREVIOUS WORK 


Bartlett (1951) derived tests of significance for the goodness of fit of a single hypothetical 
discriminant function, in terms of its direction, and of departures of the data from col- 
linearity. These results generalized some results of Williams (1952a) for the particular case 
p = 2 (two variates) and q = 2 (two degrees of freedom between groups). However, two 
points were not clarified in this work. First, while Williams used for test functions the 
partial factors, Bartlett suggested that the use of partial factors was not necessary in these 
tests. It should be clear from the discussion in § III that the partial factors are appropriate 
for significance tests. 

Secondly, it was stated by Bartlett that the general test functions could be expressed 
in terms of the original latent roots and the new latent roots after adjustment for the 
hypothetical discriminant function, but the expressions were not given. These results, 
which are given below, show the relationship of the exact tests to the approximate tests 
based on the individual ordered latent roots. Apart from this, they are of more analytic 
than practical interest, since the calculation of all the latent roots, before and after adjust- 
ment, is laborious. Moreover, in §V of this paper, simplified computational procedures 
based on the analysis of covariance will be given, which make possible the ready application 
of the exact tests without the calculation of latent roots. 


Expression of test functions in terms of canonical correlations 


For this section, we express all results in terms of the canonical variates 2;, y; and the 
canonical correlations. Let the hypothetical discriminant function which is under test be 


E=Im,2x, with Xm? = 1, 
so that the corresponding discriminant ratio is 


The first factorization of the overall likelihood criterion, representing the elimination of 
the effect of the hypothetical discriminant function, is 


I] (1—9,) = (1A) TL (1-95), 








nt 


10) 


of 


(9) 


the 
be 


1 of 





E. J. WrLiaMs 367 


corresponding to (n; p,q) = (m; 1,q)(n—1; p—1,q), and leading by the elimination of the 
factor 1—A for the hypothetical discriminant function to the residual likelihood criterion 
I (1—¢;). 


Bartlett (1951) gives two factorizations of the residual likelihood criterion, providing 
tests for simple direction and partial collinearity (equation (7) of his paper) and for partial 
direction and simple collinearity (equation (20) of his paper). In our notation, these factor- 
izations may be written 

II (1—$,) = TE (1—4,)/(1—A), 


j 
(n—1; p—1,@) 





pene II(1 — 6) (11) 


=|—T—a | |i smtoaa 
Simple direction Partial collinearity 
(n—1;p—-1,1) (n—2;p—1,q—-1) 


1) hast 11(1—0,) ., m2, 
jae rscull i, #i- i hat 





1-6 
Partial direction Simple collinearity 
(n—@; p—1, 1) (n—1;p—1,q—1) 


The degrees of freedom for these factors agree with the values given by Bartlett. Since 
the first factor of (12) has unity for one of its degrees of freedom, it may be tested by an 
ordinary F-test. 

It now remains to express the m? in terms of the original and new latent roots. From 
Williams (1952a, p. 20) we have, in the present notation, the equation for the ¢; 

2 
y ae 6;) 
v j 


On expressing this as an equation of degree p— 1 in ¢;, we have 


AP-1E m¥(1 —9,)— GFE mY 4.) (LO, 8) + = 0, 





=0 (j=1,2,...,p—1). (13) 


ice. SHAE A A)- A+ ZmiGi) + -.. = 0. 
Hence 9; = . —(A- x m7G7)/(1 —A), 
so that ; a- ~ m?63)/(1—A) = 24,—- x oj; 

x] 


and the factors (11) become ¥4O,-4,] fA (-¢,) 
| eames 
A = 49:-=X 9; “ 


Pp 
The second factor provides the test for collinearity. It takes its maximum value J] (1 —6@,), 
2 


when A = 0,. The use of this product therefore overestimates the significance of departures 
from collinearity. When p = 2 it reduces to 


(1 —4,) (1-42) 





(1-0 a ) = [1+v,/(1—v,)(1—v2)}" (notation of Williams, 1952a), 
(1—4;) (1-42) | 
1-A “Ss 


and is thus a function of the variance ratio previously given for testing collinearity. 











368 Tests for discriminant functions and linear functional relationships 


When q = 2 the second factor of (11) becomes 





(1—0,) (1—4,) : a 
=1-Q/P (notation of Williams, 1952a 
9, +6,—9,— $y e/ ( 


= [1+2,/(1—2,)(1—»,)}", 


which again corresponds to the test function given in Williams (1952a, p. 26). 
For the alternative factorization it is convenient to write 





© = 6/(1—9), 
® = ¢/(1—4), 
A = A/(1—A). 
Then equation (13) becomes 
m2 
=. ae: wkd = 
25,6, 0 (=1,2,...,p—)), 
so that OP-1— OP-2 FY mi(Y O,—O,) +... = 0. 
sina h 
Hence rmjO; = +9;-=UO,;, 
i j 


and the factors (12) become 


_ A )ma-¢,) 
2 9,- 5 %,| |*_, — (29,-2,)]. (12’) 
j d i j 


The first factor provides the test for direction; being a (n —q; p—1, 1) variable it may be 
tested by means of an F-test with p— 1 and n—p—q+1 degrees of freedom: 





pa *~P-9t+!1 (1-(n—q; p—1, 1) 
: A 


p- (n—q; p—1,1) 


When A is close to @,, so that the hypothetical discriminant function differs little in direction 
from the first canonical variate, the first factor of (12) is approximately A/6,. This provides 
& justification of the approximate F-test 


_ n—p—q+1(9,—A) 
ihe p-l MA 


When p = 2 the first factor becomes 


(1—6,} (1—@,) 


(1=A)(1=8,_/4) 


the test function previously given. 
When q = 2 the first factor reduces to 


(1—v,)(1—vs,), 


again equivalent to the test function given previously. 








rep 








E. J. WILLiaMs 369 


The two factorizations may be shown in closer relationship by means of the following 
representations, which make use of the fact that (1—¢,) = (1+ ®,)-?: 














Ei ATL (1— 95) 
'rO,- LO] PATI (1+ ®,) 
re s. I Sue +e i 
IT (1+®)) =| a 50,-50,|" 
v 7 


V. DERIVATION OF TESTS FOR DIRECTION AND COLLINEARITY 
BY COVARIANCE ANALYSIS 


It has been shown by Bartlett (1939) that a general test for the adequacy of a discriminant 
function is provided by the residual likelihood criterion after the elimination by covariance 
of the effect of the function. A point that has been overlooked, however, is that by a simple 
extension of this method the separate tests for direction and collinearity of the function may 
be derived. Bartlett there deals with the simple case of discrimination between two groups 
(q = 1). In the treatment below we shall consider the regression model, in the general case 
with q independent (or instrumental) variates. 


(a) The case p = 2 
To introduce the method, we consider first two variates X, and X,, and their correlation 
with the q variates Y,, Y,, ..., Y,. Let the hypothetical discriminant function be 


To test the adequacy of £, we consider the residual correlation of the X; with the Y; after 
the elimination of £ by covariance. We may adjust either X,, X, or any linear function of 
them, for its dependence on £; the same analysis is achieved in any case, since € is itself a 
linear function of X, and X,. 

For the overall test, the sum of squares, with q degrees of freedom, for regression on the 
Y, after adjustment for &, is tested against the reduced residual sum of squares, with n —q—1 
degrees of freedom. Now the adjusted regression sum of squares may be separated into two 
parts: (i) one with g—1 degrees of freedom, namely, the sum of squares for regression, 
reduced by the elimination of the effect of £, and (ii) one with 1 degree of freedom, repre- 
senting the regression of the regression values on &. 

In the analysis of variance model with q degrees of freedom between groups, the two 
parts of the adjusted sum of squares between groups are (i) the reduced sum of squares 
between groups, and (ii) the sum of squares for the difference of between-group and within- 
group regressions. 

Then it turns out that the reduced sum of squares (i) gives the sum of squares for testing 
the independence of the adjusted X, and Y; (or the collinearity of populations in the analysis 
of variance model), while the sum of squares (ii) provides the test for direction of the pro- 
posed discriminant function. The tests are equivalent to those given in the formal analysis 
of variance presented previously (Williams, 1952a, p. 24). 


i 
} 
i 
| 
qt 











370 Tests for discriminant functions and linear functional relationships 


This equivalence may be established as follows. 
With the notation given in §II, we can express the determinantal ratios in terms of 
symmetric functions of the latent roots (discriminant ratios). Thus 


| B\/|7| = 09s, 
| W |/| T | > (1—0,) (1—4,), 
while, by definition, Dec/tee = A. 


Then the sums of squares for X,, after adjustment by covariance for £, may be expressed 
in terms of determinantal ratios or latent roots as follows: 




















| Sums of squares in terms of 
Degrees | 
of 
freedom : . | 
Determinantal ratios Latent roots 
b acadenicath seslt-ehmitienen’ o | 
Difference of regressions aaa (0,—A) (A—9,) G| T| 
(direction) ' Ta(|Thiteg— | Bl /gg—| W litvge) | A(1—A) tee 
Between-groups reduced | en 2| BI /b | 0,9, 13|T| 
(collinearity) | q : 5§ Ate 
1—6,) (1—4,) &|T 
Within-groups reduced | n—q-l | W | /rwee cat 2) a. 
| | ar 
Total reduced | mn—1 B|T | /tee | all 
gE 
oe | : 











The right-hand column gives terms which are proportional to the quantities used in testing 
direction and collinearity against a residual term. This analysis further justifies the pro- 
cedure given in the earlier papers, of testing the partial factors (i.e. terms corresponding 
to the terms given in the above covariance analysis) rather than the simple factors. 

This approach to the tests for the adequacy of a hypothetical discriminant function 
greatly simplifies the computations. By this method it is not necessary to calculate the @,, 
nor even to determine the canonical variates z;. A straightforward covariance analysis is 
all that is required. However, if what is required is not a test of the adequacy of a given 
discriminant function but the calculation of the most satisfactory discriminant function 
from the data, then the canonical analysis is still necessary. 


(b) The case p> 2 
The procedure outlined above may be generalized for greater values of p. We take the 
hypothetical discriminant function to be 


and consider the analysis of covariance of the X; on £. Since is a linear function of the X;, 
the number of linearly independent variates after the elimination of £ is reduced to p—1. 
As for the case p = 2, the sums of squares and products of the residual variates may be 











Fo 


wh 


by 





of 








E. J. WILLIAMS 371 


separated into terms for direction and collinearity, the term for direction having a single 
degree of freedom. From the p— 1 residual variates, a linear function may be made up in 
the same way as a discriminant function for a comparison for which the sum of squares for 
direction is relatively maximized. This sum of squares clearly has p — 1 degrees of freedom. 
When the sum of squares is maximized relative to the reduced total sum of squares, we get 
the sum of squares for the simple direction effect (collinearity being ignored) (11). When it 
is maximized relative to the reduced sum of squares for residuals from regression on the 
Y,;, we get the sum of squares for the partial direction effect (12). The analysis will be demon- 
strated for the simple criterion for direction; the partial criterion may be similarly derived. 

For any two of the variates X, and X,, the sum of products for direction may be derived 
by difference in the manner given above (when p = 2), but may be expressed more simply 


for the present purpose as 
. ere dy, ; bee Wee |tee, 


et 3 
De. ; 
where d,= (2 ~ =) ; 
a, 
Now consider some linear compound of the X;, say 
€ =4X,+4hX_+... +1, Xp. 


For this compound, the sum of squares for direction is 


_ 2 
(= ids) bee Wee/tee, 
{= 
while the total sum of squares is 


a 


To maximize the ratio of direction to total, which we denote by D, with respect to the 
l;, we have - * 
i 1 ea 


The scale of the /; being arbitrary, we may so choose it that 


Pp 
1 § = 
Pp 
whence Dhatar = %- (15) 
1 


If we write the typical element of the inverse of 7' as ¢”‘, then the solutions of (15) are given 
by nat it) 
l; = p? thd,,. 
1 
Hence, from (14), sel 
D = be pF thd, d; bes Wee/tee. 
11 ; 


It is to be noted that ¢ is orthogonal to £. This may be obvious from its derivation; 
alternatively, the total sum of products of € and £ is 


Ms 
-M 


p 
Ly lity: mg Ld, 


=0, 











372 Tests for discriminant functions and linear functional relationships 


while the sum of products for direction is proportional to 


L,U,d),,d; => 0. 


eMs 
“Ms 


The quantity D. being an invariant, may readily be expressed in terms of latent roots 
in order to show the equivalence of the present results with those given in §IV. Trans- 
forming to the canonical variates z;, and putting as before 


ee - 
> 2M; X;, 





we have d; = 7A mA) 
#4 xm2(4;— A)? 
aie es 
xm? 6? — A? 
ae 
— ¥m2 642!) 
so that kiss: Pe ed 


which agrees with (11). 

To derive the test function for partial collinearity in terms of determinantal ratios, we 
make use of the factorization (11). 

The residual likelihood criterion 


, |W| te 
1-0,)/(1—A) = 1. 
IT ( )/(1-A) [T| wee 
Hence the partial collinearity factor is 
|W | tee |W | te J/ 2 eee 
ot -£/(1-D) = |__ 1 & (t..- We SS thid,d,), 16 
| 7"| =f : |7| af * ete De ™ ‘) "7 


which is a (n— 2: p—1.q—1) variate. 

In the same way, the partial direction factor may be found by maximizing the ratio of 
the sum of squares for direction to the reduced residual sum of squares. The maximized 
ratio is 

fgg a (17) 
tee + Og, Wee dw"d, d; 
This is a (n—q; p— 1, 1) variate, so may be tested by the F-test: 


R—p— q+ l bee Wee eT 
= Lwtd, d;, 
p-l tee 


F= 


with p— 1 and n— p—q+1 degrees of freedom. 
It may be verified that these results agree with the general results given by Bartlett 
(1951) and the particular results given by Williams (1952a) for p = 2, q = 2. 


VI. GENERAL REMARKS ON THE LINEAR FUNCTIONAL RELATIONSHIP 


As has often been remarked, the linear functional relationship between two variates subject 
to error is different from either of the regression relationships between the two; the relation- 
ships are identical only when the independent variate in the regression relationship is 








ve 


of 
ed 


tt 





E. J. WILLIAMS 373 


errorless. Also the two relationships have different applications. The regression relationship 
is the more generally useful, relating as it does to observed values; an important application 
is the prediction of either observed or ‘true’ values of one variate from observed values of 
the other. The functional relationship connects the ‘true’ values of the two variates; its 
use is in theoretical studies of underlying ‘laws’. In most practical applications the regres- 
sion and not the functional relationship is required. Since we are here considering only 
linear relationships, the word ‘linear’ will usually be omitted. 

The regression relationships are based on the variation in both the ‘true’ values and the 
random errors to which they are subject, the functional relationship on the variation in the 
‘true’ values alone. A little reflexion will show (and examination of the literature will con- 
firm; see, for example, Haavelmo (1943)) that the functional relationship is therefore 
relevant only to a study of how the ‘true’ values of both variates are affected by some 
extraneous variate or variates; that is to say, the relationship shows what elements of the 
system are invariant under changes in conditions. It is not of interest to know the under- 
lying relationship (if any) between two variates when each is affected only by random error; 
usually what is then wanted is one or other of the regression relationships. 

It is well known that, when both the ‘true’ values and the errors are normally distributed, 
the functional relationship cannot be determined from data. Lindley (1947), Reiersel (1950) 
and others have proved a number of theorems which show that the non-normality of one 
of the distributions is necessary for the estimation of the relationship to be possible. Thus 
it may be taken that functional relationships are not determinable from the internal 
analysis of a set of variates with normal distributions. Put in another way, since the first- 
and second-order sample cumulants summarize all the information in samples from normal 
populations, it follows that information beyond the first- and second-order cumulants of 
the distributions must be available if the functional relationship is to be determinable. This 
information may be present in the sample, if the distributions are not normal, for then the 
first- and second-order sample cumulants are no longer sufficient statistics; alternatively, 
it may be introduced through knowledge of the relationship of each variate with some 
extraneous variates. As mentioned above, it is only under such circumstances, when some 
such additional information exists, that the functional relationship is of interest; in other 
words, whenever functional relationships are of practical interest they are also determinable 
from data. 

In passing it may be mentioned that Berkson’s (1950) method of ‘controlled’ variables, 
as elucidated by Lindley (1953), is really a means of dealing with one variate as though it 
were errorless, and bringing the estimation of the functional relationship back to the 
estimation of a regression equation. 

The foregoing discussion could be generalized for the determination of relationships 
among three or more variates, for which the same general considerations apply. We shall 
be concerned henceforth with the determination of functional relationships among a set 
of variates through their relationships with the variates of a second set. Reiersol (1941, 
1945) has termed the second set instrumental variates, to distinguish them from the first set 
of investigational variates. This approach has been used also by Geary (1943, 1949) and 
others. : 

Another method (Wald, 1940; Bartlett, 1949) for determining the functional relationship 
uses groupings of the values of the variates. Neyman & Scott (1951) have shown that only 
‘in very exceptional circumstances’ does the method provide consistent estimates. Roughly 











374 Tests for discriminant functions and linear functional relationships 


speaking, the method leads to consistent results provided the separation of values of one 
of the variates is sufficiently wide in the neighbourhood of the group limits, so that the 
grouping based on observed values is equivalent to a grouping based on ‘true’ values. 
Clearly, under these conditions, the differences between groups may be attributed to some 
extraneous variates, so that the method of grouping is then roughly equivalent to the 
method of instrumental variates. 

Both methods have this in common, that they introduce extraneous variation affecting 
all the variables between which the relationship is found. The existence or assumption of 
this variation is basic to the methods. 

In the next sections of this paper we give some exact tests for the existence of and for 
the constants in linear functional relationships, based on the use of instrumental variates. 
We shall assume for convenience that the investigational variates are subject to normally 
distributed errors, but make no particular assumptions about the distribution of the 
instrumental variates. Indeed, the instrumental variates may even be comparisons among 
groups or populations, in accordance with the general specification outlined in the Intro- 
duction. 


VII. Txsts ror A SINGLE FUNCTIONAL RELATIONSHIP 
(a) Preliminary remarks 


In § VI it was pointed out that consistent estimates of a functional relationship between two 
variates could be determined provided there was one instrumental variate. In general, a 
functional relationship among p variates can se determined if there are p—1 or more in- 
strumental variates. In this section we shall consider, not the determination of the func- 
tional relationship, but tests for the concordance with the data of a given relationship. 
This approach parallels that adopted above in the sections on discriminant functions. 
It is possible to test the adequacy of the given relationship even when the data do not 
provide enough information to enable consistent estimates to be determined. 

We shall consider, following the set-up given in the earlier sections, that there are p 
investigational variables X; and q instrumental variables Y;. The given functional relation- 
ship among the X; will be defined by a null function £. It will be realized that, when p 
exceeds q, it is possible to choose py—q independent linear functions of the original X; 
whose correlation with the Y; (or sum of squares between groups) vanishes. These functions 
are equivalent to the »—g canonical variates corresponding to the identically vanishing 
latent roots. Thus it is not possible to determine from the sample a unique null function. 

On the other hand, the test for a given null function is always possible. The general test 
for the adequacy of the function can be considered in two parts: first, a test for the con- 
formity of the function in direction, with p— i degrees of freedom; and secondly, a test for 
the residual correlation (or reduced variation between groups), with q—p+1 degrees of 
freedom. When q < p, no test for residual correlation is possible; the test for direction then 
has but q¢ degrees of freedom. 


(6) One instrumental variate (q = 1) 


In order to gain some insight into the problems involved in testing the adequacy of a 
specified null function, we consider first the simplest case, where there is but one instru- 
mental variate, Y. In this case, the test is one of the direction of the chosen null function. 
It may be most simply viewed in the following way. The null variate £ has a sum of squares 


for reg 
wheth« 
based | 
Hence 
X;, is | 


— 











E. J. WILLIAMS 375 
| for regression on Y which bears to the total sum of squares a ratio A. It is required to test 





_ whether this is large enough to indicate real association. The discriminant function x 


based on the sample gives a ratio 0, leaving a residual 1 — 6, with n — p degrees of freedom. 
Hence the efficient test, in which the residual has been minimized with respect to all the 


_ X;, is given by the following analysis: 


























Degrees of Sum of 
freedom squares 
Additional accounted for by discriminant function x p-l 0-A 
Direction of null variate 1 A 
Residual n—p 1-0 
Total n 1 
(n—p)A 
so that Fan—p = “—a ° 


In terms of determinantal ratios, we have 
A = belt, 1-9=|WI/|T|, 
bee | T 
so that Fa n—-p = (n—P) = 1 = rt f 
This latter representation would give the simplest way of making the test. 


(c) More than one instrumental variate (q > 2) 

The method given above for g = 1 cannot be directly adapted to the more general case. 
The approach that is adopted is as follows. By hypothesis, the null variate ¢ is uncorrelated 
with the instrumental variates, so that its total variation is homogeneous. In any sample, 
it is possible to choose a set of p—1 linearly independent functions of the X; which are 
uncorrelated with £; these may be taken to define the direction of € in the p-space. We shall 
denote these p— 1 functions by Xj (t = 1, 2,...,p—1). 

Now consider the analysis of any one of the original X;, or of any linear function of them 
not linearly dependent on the X;. We can carry out an analysis of covariance of this variate 
with the p—1 variates X;, to determine the reduced sums of squares for regression on the 
Y,, for residual variation and for total variation (or between groups, within groups and total 
in the analysis of variance model). Then, provided g>p, we have the following effects 


appearing: 











Effect Degrees of freedom 
Difference of regressions p-l 
Regression on the Y,, reduced q—prt+l 
Residual, reduced ‘ n—-p-—qt+l 
Total, reduced n—p+l 














25 Biom. 42 











376 =©Tests for discriminant functions and linear functional relationships 


This analysis tests the existence of any correlation of the X; with the Y;, apart from that 
accounted for by the X;, and hence tests the correlation of the proposed null variate. The 
second term tests the residual correlation of § with the Y; and the first term the direction; 
each may be tested against the reduced residual variance. 

The analysis and significance tests are equivalent, whichever X-variate (subject to the 
above-mentioned conditions) be chosen, since the elimination of the X; by covariance 
effectively reduces the number of residual variates to one. In practice it is often simplest 
to take the X-variate as £ itself. The fact that the adjustment of £ by covariance leaves the 
total sum of squares unaltered does not affect the results; the degrees of freedom for the 
total line are still reduced to n—p+1. 

It may be verified that, when p = 2, the analysis is exactly the same as that giving the 
test of significance for the discriminant function X}. This is as it should be, since the hypo- 
thesis that Xj is the population discriminant function is equivalent to the hypothesis that 
the variate orthogonal to it (i.e. €) is a null variate. 

The analysis has its parallel in the tests developed for this purpose from Fisher’s (1938) 
results by Tintner (1945, 1946), and based on the values of the latent roots. For the existence 
of a single null function (or linear functional relationship), Tintner takes as his test function 
the smallest of p latent roots. As has been shown by Hsu (1941), this smallest latent root is 
distributed approximately as a sum of squares with q—p+1 degrees of freedom. This 
approximation becomes exact in the limit when all the other latent roots are near to unity 
(apparently even if only the sample latent roots approach unity, regardless of the values of 
the population roots). This result has been established above (10) as a particular result on 
the factorization of determinantal ratios, except that there the term for ‘difference of 
regressions’ has not been segregated from the residual. In the derivation of the tests to be 
given below, in terms of canonical correlations, the relationship of the present tests with the 
approximate tests will be made clear. 


(i) Analysis in terms of canonical correlations 

The results of the analysis of covariance may be expressed in terms of the squared canon- 
ical correlation coefficients or discriminant ratios 0,, 4, ...,9,, between the sets of canonical 
variates xz; and y;. Let the hypothetical null-variate be 


E=im,z,, im? =1. 


Then the set of p—1 a-variates orthogonal to £ may be taken as 
x, = x,—m,€. 


Now of the y;, p have non-zero correlation with the corresponding z;, and q— p have zero 
correlation with any z;. They may be transformed to another set of variates in the following 
way: (1) the set of p—1 variates most closely correlated with the x;; (2) a variate 7, uncor- 
related with the x; but correlated with the z;, and (3), as before, the g—p variates un- 
correlated with the z;. 

It is clear that the variate 7 represents the reduced correlation of £ with the y,. It is readily 
defined by the conditions given; for let 


0 = UN:Y;; 
then Xyz,=0 (h=1,2,...,p—1). 





Her 


or 





cal 


ero 
ing 
or- 
an- 


lily 








E. J. WrLiaMs 


Hence Lyx, = mM, XEN, 

or n,O% = m,im;n, 03, 
so that n,x m,6;%, 
and, since =n? = I, 


For the regression of € on 7, the sum of squares, with g—p +1 degrees of freedom, is 
(ZEm)* = j,9,/mj, = (Zmi/6,). (18) 


This is the sum of squares for regression, reduced by elimination of the x. Clearly it 
vanishes if q < p (for then @,, = 0), and in general takes values between 6,, and 6). 

In the same way it may be shown that the reduced residual sum of squares (i.e. the sum 
of squares of residuals of € from regression on the y; and the x;), with n—p—q+1 degrees 
of freedom, is 

(mi /(1—9;))~. (19) 

Finally, by subtraction, the sum of squares for difference of regressions (i.e. for the test 

of direction of the proposed null-variate), with p— 1 degrees of freedom, is 
— (mi /6,)~* — (ZmF/(1 — 6;))-. (20) 

These sums of squares may also be expressed in terms of the latent roots 6; and the latent 
roots for the variates x, and y;, which will be denoted by 6; (¢ = 1,2,...,p—1). Since & is 
orthogonal to all the x;, it follows that the product of the latent roots 6;, multiplied by the 
sum of squares for regression of on 9, is equal to the product of the original latent roots; 
that is, 

10; (2m?/6;)-1 = T19;. 
Similarly, II (1 — @;) (2m3/(1—8,)) = (1 —4,). 


The analysis of variance may then be set out as follows: 





| Degrees of freedom 


| 


Sum of squares 





| 
Direction p-l | 1 —I16,/T16; — T1(1 — 8;)/(1 — 8) 
| 





Correlation | q-pt+l 110 ;/116; 
Residual n—-p—qt+l I1(1—6,)/T1{1 — 4) 
| 
| 
Total n—-pt+l l 











If the latent roots 0; and the direction cosines m; are known, the simplest procedure is to 
calculate the sums of squares from the expressions (18), (19) and (20). 


The sum of squares for correlation always exceeds 0,,, but approaches @,, when the direc- 


tion of £ approaches that of x,. Hence it is seen that the use of 7, as a sum of squares with 
q—p+l1 degrees of freedom, according to the approximate tests, will underestimate the 
significance of the correlation. 


25-2 








378 Tests for discriminant functions and linear functional relationships 
If ¢<p, the analysis reduces to 











Degrees of freedom Sum of squares 
Direction q 1—I(1—6,)/T1(1 —) 
Residual n—p—q+l Il(1 —9,)/T1(1 — 4) 
Total n—p+l 1 

















there being in this case no test for correlation. 
When q = | this analysis reduces to that previously given; for it may be shown that, in 


general, A= >0.—-50' 

so that, in this case (1—6,)/(1—0}) = 1—A/(1—0, +A) 

and the sums of squares for direction and residual are respectively 
Al(1—0, +A) 

and (1—86,)/(1-—6,+A), 


which are proportional to those previously given. 
It may readily be verified that, when p = 2. these sums of squares reduce to those given 
for a test of a single discriminant function. 


(ii) Analysis in terms of determinantal ratios 


For computing purposes it is myre straightforward, when the individual latent roots and 
canonical variates are not required. to express the results just given in terms of deter- 
minantal ratios. 

We consider the analysis of the null variate £ (= =/; X;, no longer necessarily normalized 
so that its sum of squares is unity), deriving the reduced sum of squares for regression, with 
q—p+1 degrees of freedom, and the reduced residual sum of squares, with n—p—q+1 
degrees of freedom. The sum of squares for ‘difference of regressions’, with p— 1 degrees 
of freedom, is obtained by subtraction of these two items from the (reduced) total. 

We may take the X; to be 


and shall write 
P, for the vector ( p,1, Pj, ---, Pig)s 
T, for the vector (t;;, tg, ---5 tip), 
T,_ for the vector (t;;, tia, ---, t;, p—1) 
P_ for the matrix consisting of the first p— 1 rows of P. 
and 7. for the matrix consisting of the first p — 1 rows and columns of 7’. 
To calculate the reduced sum of squares for regression, the sums of squares and products 


of the Xj and € for the regression of each on the Y; are first determined. For example, the 
sum of products of X; with Y; is 


Dis — Postel tes, 





so th 


Acco 


the 











E. J. WILLIAMS 379 
so that the regression sum of products of X}, and X; is 
(Py, — Petzn/teg) U* (Pi, — Pz tei) tee). 
Accordingly, the matrix of regression sums of squares and products of the X; is 
(P_—Ty_ P;|tg¢) U-\(P_ — P.T,_|tg). (21) 


Now the reduced sum of squares for the regression of £ on the Y; is the ratio of two deter- 
minants: that of the regression sums of squares and products of the X{ and &, and that of 
the regression sums of squares and products of the X; alone. By transformation without 
change of scale, the X; and £ may be replaced by X,, Xq,...,X,_;, and 1, X,,; hence the 
determinant in the numerator is equal to 


| PU-P’| =2 | BI. 
The required denominator is the determinant of the matrix (21), which is found, by elemen- 
tary transformations, to be 
Bl 0. F | ci... £ | 


~ | 1 PU=P'|~~ 8) 7 B| 





2 
= 2|B| B37, 
gE 3 


Hence the reduced sum of squares for regression is 
t-/T; BT. 
In the same way, it may be shown that the reduced residual sum of squares is 
R/T WAT, 
Also, since T.TT; = tee, 
we see that the reduced total sum of squares, which is given by the analogous formula 
#/Q TT; 
is equal to fre, as it should be. 


On elimination of common factors from each term, the analysis of variance may be set 
out as follows: 











Degrees of freedom Sum of squares 
Direction p-l (by subtraction) 
Regression, reduced q-pt+l (T;BT,)> 
Residual, reduced n—p—qt+l (T;W-T,)* 
y " ~19"\—1 — s-1 
Total, reduced n—p+l J (7;T T) b= tee 

















It may readily be verified that, when the variates are transformed to the canonical set, 
the terms in this analysis of variance reduce to those given in the previous subsection. 








380 Tests for discriminant functions and linear functional relationships 


VIII. GENERALIZATIONS 


It has been shown above that the test for a given functional relationship as developed in 
this paper is really equivalent to the test for the adequacy of p— 1 discriminant functions 
to represent the variation in the data. Thus we have treated here only the tests for 1 or p—1 
discriminant functions. Tests for any number r of discriminant functions may be derived 
in the same way, but the cases treated above are the simplest to deal with, since they each 
deal with only one function (discriminant function or null function); fiducial limits for the 
constants in these functions can therefore be derived using these significance tests. 

If r hypothetical discriminant functions are given, their effect may be eliminated, giving 
the following factorization of the likelihood criterion: 


(n; p,q) = (n; r,q)(n—1r; p—T,q). 


The second factor on the right is the residual likelihood criterion, which may be further 
factorized to give criteria for direction and collinearity: 


(n—r; p—1r,q) = (n—71r; p—?r,r) (n—2r; p—r,q—r) 


simple direction partial collinearity 
= (n—q; p—?r,r) (n—7r; p—1r,q—r) 
partial direction simple collinearity 


When r = p—1, one of the degrees of freedom in each factor is unity, which is the reason 
why the tests for a single null function can be thrown into the form of an analysis of variance. 
We then have 
(n—p+1;1,q) =(n—pt+l1;1,p—1) (n—2p4+2; 1,q—p+1) 
= (n—q; 1,p—1) (n—p+1;1,q—p+]) 


which corresponds to the analysis of variance set-up previously given. 

There seem to be practical applications of tests for more than one functional relationship; 
for example, Bartlett (1948) discusses the problem of estimating supply and demand 
relationships. On the other hand, it seems that seldom if ever is more than one discriminant 
function required in specifying differences among populations; if these differences cannot 
be interpreted in terms of one discriminant function, what is then required is some likeli- 
hood test as a means of classifying the observations into the different populations. 


REFERENCES 


ANDERSON, T. W. (1951). Estimating linear restrictions on regression coefficients for multivariate 
normal distributions. Ann. Math. Statist. 22, 327. 
BaRTLETT, M. S. (1938). Further aspects of the theory of multiple regression. Proc. Camb. Phil. Soc. 


34, 33. 
Bart ett, M. 8. (1939). A note on tests of significance in multivariate analysis. Proc. Camb. Phil. 
Soc. 35, 180. 


BarRTLETT, M.S. (1947). Multivariate analysis. J. R. Statist. Soc. Supp. 9, 176. 

- Bartett, M.S. (1948). A note on the statistical estimation of supply and demand relations from time 
series. Econometrica, 16, 323. 

BaRtTtLetT, M. S. (1949). Fitting a straight line when both variables are subject to error. Biometrics, 
5, 207. 

BarTLETT, M.S. (1951). The goodness of fit of a single hypothetical discriminant function in the case 
of several groups. Ann. Eugen., Lond., 16, 199. 

Berkson, J. (1950). Are there two regressions? J. Amer. Statist. Ass. 45, 164. 





Qa ei 


Ono 


lin 
ons 
—1 
ved 
ach 
the 


ing 


‘her 


son 
nce. 


hip; 
and 
lant 
not 
<eli- 





E. J. WiLLiaMs 381 


FisHER, R. A. (1938). The statistical utilization of multiple measurements. Ann. Eugen., Lond., 8, 376. 

Grary, R. C. (1943). Relations between statistics: the general and the sampling problem when the 
samples are large. Proc. R. Irish Acad. A, 49, 177. 

Geary, R. C. (1948). Studies in relations between economic time series. J. R. Statist. Soc. B, 10, 140. 

Geary, R. C. (1949). Determination of linear relations between systematic parts of variables with 
errors of observation the variances of which are unknown. Econometrica, 17, 30. 

HaaveEtoMo, T. (1943). The statistical implications of a system of simultaneous equations. Econo- 
metrica, 11, 1. 

Hsv, P. L. (1941). On the problem of rank and the limiting distribution of Fisher’s test function. 
Ann. Eugen., Lond., 11, 39. 

Koopmans, T. (1937). Linear Regression Analysis of Economic Time Series. Haarlem: DeErven, F., 
Bohn, N.V. 

LinD.ey, D. V. (1947). Regression lines and the linear functional relationship. J. R. Statist. Soc. 
Supp. 9, 218. 

LinDteEy, D. V. (1953). Estimation of a functional relationship. Biometrika, 40, 47. 

NeyMan, J. & Scott, ELIZABETH L. (1951). On certain methods of estimating the linear structural 
relation. Ann. Math. Statist. 22, 352. 

REIERSOL, O. (1941). Confluence analysis by means of lag moments and other methods of confluence 
analysis. Econometrica, 9, 1. 

RErIERSOL, O. (1945). Confluence analysis by means of instrumental sets of variables. Ark. Mat. 
Astr. Fys. A, 32, no. 4. 

REIERSOL, O. (1950). Identifiability of a linear relation between variables which are subject to error. 
Econometrica, 18, 375. 

TINTNER, G. (1945). A note on rank, multicollinearity and multiple regression. Ann. Math. Statist. 
16, 304. 

TintTNER, G. (1946). Multiple regression for systems of equations. Econometrica, 14, 5. 

TrntNER, G. (1950). A test for linear relations between weighted regression coefficients. J. R. Statist. 
Soc. B, 12, 273. 

Watp, A. (1940). The fitting of straight lines if both variables are subject to error. Ann. Math. 
Statist. 11, 284. 

Wiu1aMs, E. J. (1952a). Some exact tests in multivariate analysis. Biometrika, 39, 17. 

Witurams, E. J. (19526). The interpretation of interactions in factorial experiments. Biometrika, 
39, 65. 

Wittrams, E. J. (1952c). Use of scores for the analysis of association in contingency tables. Bio- 
metrika, 39, 274. 

Wiu1aMs, E. J. (1953). Tests of significance for concurrent regression lines. Biometrika, 40, 297. 











[ 382 ] 


THE USE OF TRANSFORMATIONS AND MAXIMUM LIKELIHOOD 
IN THE ANALYSIS OF QUANTAL EXPERIMENTS 
INVOLVING TWO TREATMENTS 


By F. YATES, F.R.S. 
Rothamsted Experimental Station 


If the effect of a treatment is of the quantal (all or nothing) type, e.g. it kills or does not kill 
an insect, the results of experiments will be in the form of proportions of experimental units 
affected. If ¢ such treatments are compared in a single experiment in which the experi- 
mental units are randomly allocated, the differences between the observed proportions will 
give estimates of the differences between the treatments, and a x? test on the associated 
2xt table (t—1d.f.) provides an overall test of significance. If, however, in order to 
increase the precision, the units are subdivided before allocation into groups which are 
relatively homogeneous within themselves, the experiment will be analogous to an ordinary 
randomized block experiment, the groups being the blocks, but the quantities requiring 
analysis (corresponding to the ‘plot yields’) will be proportions instead of quantitative 
measurements. When a set of experiments is carried out, with random allocation within 
each experiment, but with possible lack of homogeneity between experiments, the situation 
is similar, the groups being in this case experiments. 

If the numbers of experimental units in the various cells of the group x treatment or 
experiment x treatment table are approximately equal, or more generally if they are pro- 
portionate, the analysis presents no great difficulty. The use of the angular transformation, 
and an analysis of variance of the transformed values, will frequently be all that is required. 
Indeed, if the variation in susceptibility from experiment to experiment is not large, the 
results will not be seriously distorted if the data from the different groups or experiments 
are pooled and treated as homogeneous. If, however, the numbers in the different cells 
vary in an irregular manner, as is often the case in material of this kind, pooling is inadmis- 
sible and inequality in the weights prevents straightforward analysis. 

The application of methods of estimation to quantal data has been extensively developed 
in recent years through the use of transformations in conjunction with maximum likelihood. 
Exact analytical methods were first developed for the probit transformation, in connexion 
with biological assay, but similar methods have recently been developed for other trans- 
formations of which the logit and log log are of most interest in this connexion. These methods 
are beginning to be used for the analysis of quantal data of many kinds (see, for example, 
Jolly, 1950; Dyke & Patterson, 1952; and Yates, 1953, § 9-7), but their utility for dealing 
with experimental data of the type we are considering is still insufficiently realized. Com- 
bination of probabilities, for example, is still commonly used to provide a test of significance 
(sometimes called the probability integral test) for sets of experiments involving two 
treatments. 

The estimation approach automaticaily provides a test of significance as a by-product 
of the estimate and its estimate of error. Its greatest advantage, however, is that if an 
effect is demonstrated we are immediately provided with an estimate of its magnitude in 




















F. Yates 383 


meaningful terms, together with approximate fiducial limits. At the same time, answers are 
provided to certain ancillary questions regarding the homogeneity of the data. 

The analysis of a group of experiments containing several treatments has been discussed 
by Jolly using the probit transformation. The present paper deals in more detail with the 
case of two treatments, particularly in respect of errors and tests of significance. The appro- 
priateness of the various transformations under different circumstances is also considered 
in more detail than by Jolly. A further new point is the use of weighted estimates to give 
a first approximation. The investigation originated in a request by Prof. Gert Bonnier of 
Stockholm for advice on the exact procedure for applying the combination of probabilities 
test to a set of genetical experiments on mutation rates in Drosophila, and I am much in- 
debted to him for permission to reproduce the data here. The numbers involved in these 
experiments were sufficiently large for the data to be analysed by large-sample methods, 
and this analysis will first be described, as it brings out many of the points at issue without 
the complexities associated with the use of transformations and the method of maximum 
likelihood. The maximum-likelihood method will then be developed, and applied to this 
set of data and to three examples given by Pearson (1950). 


THE COMBINATION OF PROBABILITIES TEST 


If we have a set of k experiments (or observational data) comparing two treatments, the 
results of each experiment can be set out in the form of a 2 x 2 table. A test of significance 
on the data as a whole can be made by calculating the significance level P for each experi- 
ment separately, and then combining these probabilities by the method first suggested by 
Fisher in the fourth edition of Statistical Methods for Research Workers (1932), and indepen- 
dently by Karl Pearson (1933). This consists of calculating the sum of the quantities 
—2log, P, which are the values of x? for 2d.f. corresponding to the significance levels P. 
If there is no difference between the treatments, then, subject to certain qualifications, 
S(—2log, P) will be distributed as y? with 2k d.f. 

This test and the analogous test provided by direct summation of the values of x? for the 
separate experiments have been discussed by various authors, in particular Cochran (1942, 
1952), Lancaster (1949) and Pearson (1950). Certain further points emerged in the course 
of the present investigation, but these are best dealt with in a separate paper (Yates, 1955). 
Here it is only necessary to make the following general points: 

(1) Application of the combination of probabilities test does not provide an analysis 
which is in any sense complete, since no estimate of the magnitude of the treatment difference 
is provided. 

(2) The test must be in some degree inefficient. There are two main reasons for this: 

(a) No account is taken of the relative accuracy of the different experiments when 
combining the results. 

(b) S(—2log, P) does not provide an efficient estimate of the treatment difference, even 
when all experiments are of the same accuracy and the difference is small, and consequently 
cannot be the basis of an efficient test of significance. 

It is perhaps also worth noting here that Fisher did not envisage the use of the com- 
bination of probabilities test on data of the type we are considering. He put it forward for 
use in cases in which it is desired to obtain a single test of the significance of an aggregate of 
probabilities, ‘taking account only of these probabilities, and not of the detailed composition 
of the data from which they are derived, which may be of very different kinds’: 











384 Use of transformations and maximum likelihood 


DROSOPHILA DATA: THE LARGE SAMPLE APPROACH 


The data with which we are concerned are shown in Table 1. They were derived from a set 
of six experiments on the frequency of lethals in paternal X-chromosomes from irradiated 
spermatozoa of Drosophila melanogaster. The experiments were carried out to test whether 
there was any difference in this frequency when the irradiation was given to spermatozoa 
which were harboured (F) in the females’ receptacles, and (M) in the males’ testicles. In the 
first four experiments a dose of 960r. was given, and in the last two a dose of 3000r. In 
addition to the change of irradiation rate it was believed that unavoidable environmental 
changes between experiments might influence the mutation rates. An account of the 
experiments has now been published (Bonnier & Liining, 1953). 


Table 1. Frequency of mutants with two methods of irradiation 











Method F Method M 
Exp. 
Mutant Normal % mutant Mutant Normal % mutant 
1 29 815 3-44 45 1486 2-94 
2 78 2622 2-89 27 1ac8 1-84 
3 110 3175 3-35 100 4281 2-28 
4 25 1038 2°35 52 2053 2-47 
5 31 339 8-38 43 507 7-82 
6 57 543 9-50 27 436 5-83 





























It will be seen that in all experiments except Exp. 4 method F gives a higher mutation 
rate than method M. We require to test the significance of this difference, and obtain an 
estimate of its magnitude. 

In order to develop a suitable estimation procedure we must specify in mathematical 
terms the quantities that require estimation. This specification wil! depend on the pheno- 
mena that are being investigated. In the present case we may postulate that with a given 
method of irradiation and an experiment of standard sensitivity a unit of irradiation has 
a given small chance 6’ of producing a mutation at a given locus. The chance that A units of 
irradiation will produce a mutation at that locus will then be 


1—e~*”, 


If several loci are involved, with probabilities 6’, 0”, etc., and 6’+6"+... = @, the total 
probability of one or more mutations will be 


1 — e~A'+0"+---) = ] —e-A9,_ 


If the conditions vary from experiment to experiment a sensitivity factor , will have to be 
. introduced, giving a total probability of mutation for experiment r of 1 —e~*r4r?, where A, 
is the irradiation rate for the rth experiment. 
With two methods of irradiation the probabilities of mutation in experiment r can 
therefore be written as 
7, = 1—exp[— A,r); 


My = 1—exp[—A,H4,64]. 











Sin 


Su 


ca 


If 


SC 


at 





otal 


0 be 


eA, 


can 











F. YATES 385 
Since 4, is unknown it must be eliminated, giving 


A, = log (1 —7,,) 


6,  log(1—7,)" 

Substituting the observed mutation rates p,, and pp, for 7, and 7, we obtain an estimate 
of 0,/0, from the rth experiment. These separate estimates can then be combined by 
weighting inversely as their estimated variances. 

In the present example the mutation rates are sufficiently small for 7,, and 7, to be taken 
as approximately equal to A,,6,, and A,u,9, respectively. There will be certain advantages 
in working with estimates of 0,-0 

1—% 


4(0, + 9s) 
instead of 0,/0,, i.e. the ratio of the difference of the mutation rates for experiments of unit 
sensitivity per unit irradiation to the mean of these mutation rates. The quantity 
Pir— Por 


£, = 
3( Diy + Por) 
will provide an estimate of this ratio. An approximation to the variance of E, is easily 
calculated by large-sample theory as follows: 
ok, eh 4po, ok, ee — 4p yy _ 


OP, (Pirt+ Poy)?” CPoyr (Pay + Par)?’ 








If the rth contingency table is written 


F M Total 





Mutant Nir Ns, ny 
Normal Nyy — Ny Ror — Nop n,—N; 











Total ir Nor nN, 


Pd ae | r y 
so that p,, = 74,/%1,, Por = No,/No,, We have 


: 7,,(1—74, ; 7 — To, 
V(py) = : 4 V(po,) = ————._ COV Pip Dg, = “. 
Np Lop 
Hence. approximately, 
VB) = 16p),Por | Pedr Pur (1) 


r (Pir + Por)* x 





M4, Noy a 


This formula is approximate both because we are dealing with a non-linear function and 
because 7r,, and 775, have been replaced by p,, and p,,. If we are concerned with testing the 
significance of the difference of p,, and p,, we may replace them both by the pooled estimate 
p, = n,/n, from the marginal totals of the table. Since the factor 1/(p,,+ p,,) occurs in E, 
the factor 1/(p,,+o,)” may be left unchanged in the variance. We then have 





t PrOp 
Vi mY pa Aa # rir z (2) 
r) (Pir + Por)” Rip Noy 








386 Use of transformations and maximum likelihood 


It may be noted that under these circumstances 


EX _ Mf Myp(Map — Nay) — Nop( Mp — Nip) }* 
a = , 
J (£,) Ny, NopN,(Ny <i n,) 





> 


which is the ordinary expression for y? in a 2 x 2 contingency table, without correction for 
continuity. 

As has been shown by Cochran (1942) and others, the correction for continuity is not 
required when combining data from several tables. If, therefore, we take a weighted mean 
of the estimates EZ, with weights 1/V(Z,), the resultant test of significance may be expected 
to be adequately accurate when the values of x? corrected for continuity give satisfactory 
approximations to the one-tail probabilities of the distributions generated by the 2 x 2 
tables. This will be the case in the present instance, since all expectations are reasonably 
large and n,, and v,, differ by less than a factor of 2 in all experiments, and are very nearly 
equal in total. 

The calculations for using formula (2) for V(Z,) are shown in Table 2. The weighted 
mean of £ is 


E = 44-182/153-85 = 0-2872, 


with standard error 1/,/153-85 = 0-0806. The level of significance (one tail) is therefore 
P = 0-000184. 


Table 2. Calculation of E 





























Exp. E V(B) w wk 
l +0-1559 0-05459 18-32 2-856 
2 + 0-4421 0-04622 21-63 9-563 
3 +0°3786 0-01790 55-88 21-156 
4 —0-0489 0-05776 17-31 — 0-846 
5 + 0-0692 } 0-05099 19-61 1-357 
6 + 04785 0-04739 21-10 10-096 
| | 
| +0-2872 + 0-0806 | == 153-85 44-182 





P = 0-000184. 


The results of Table 2 also provide a simple test of the constancy of (0, —6@,)/4$(4, +4) 
over the different experiments. All that is necessary is to calculate the weighted sum of 
squares of deviations of Z. This is 


Sw,(E,—E)* 


Sw, E? — ESw,E, 
= 4-962. 


If there is no real variation this will be distributed as y? with 5d.f. This gives P = 0-42, so 
there is no evidence of variation. 

The same procedure may be adopted to test whether there is any evidence of variation 
in the mutation rates from experiment to experiment, apart from those due to differences 
in irradiation rate. For this purpose it will be appropriate to test whether the estimates of 





(9, 
stan 
of si 
no € 
diffe 


ore 





», sO 


tion 
nces 
s of 





F. Yates 387 


4(0,+0,) are homogeneous. These estimates are given in Table 3, together with their 
standard errors calculated in the conventional manner. The value of the weighted sum 
of squares of deviations of these rates is 5-32 (x* with 5d.f.) giving P = 0-38. There is thus 
no evidence of any such variation. The postulated law also accounts satisfactorily for the 
difference in mutation rates at the different levels of irradiation. 


Table 3. Estimates of $(@,+6,) x 104 











Exp. 
1 0-332 + 0-0396 
2 0-246 + 0-0248 
3 0-293 + 0-0201 
4 0-251 + 0-0300 
5 0-270 + 0-0306 
6 0-256 + 0-0270 
Weighted mean 0-272 + 0-0110 














It may be noted that if the irradiation rate had been constant, and the difference between 
6, and @, could be neglected, the ordinary x? test on the 2 x 6 contingency table of the values 
of n) and n,—n; could be used to test the constancy of the mutation rate from experiment 
to experiment. 

Since the difference between the p,, and p,, is demonstrated we might revise the calcula- 
tions by using formula (1) for V(£,). This, however, is merely a partial step in the direction 
of the small sample solution by maximum likelihood, which will be outlined in the following 
sections. 

Cochran (1954) has adopted a similar procedure to that given above in order to provide 
a test of significance for a set of 2 x 2 contingency tables. His test criterion is the weighted 
mean of the quantities p,, — po,. Thisis, in fact, equivalent toa weighted mean of the quantities 

Ei, = Pir Por 
Prd 
E’. differs from #, only in the substitution of the pooled estimate p, and the introduction 
of the factor g,, which gives a difference, analogous to logits, which may be expected to be 
reasonably constant over a wide range of p,. Regarded in this way, therefore, Cochran’s 
analysis could easily be extended to provide a test for the constancy of the difference from 
experiment to experiment. 


MAXIMUM-LIKELIHOOD SOLUTION : CHOICE OF TRANSFORMATION 


The maximum-likelihood solution is most easily obtained by the use of a suitable trans- 
formation, as, for example, the probit transformation, which has become familiar in 
toxicology and biological assay. The maximum-likelihood solution can then be obtained by 
successive approximaiion. The choice of transformation depends on the mathematical 
specification (the ‘model’) which is considered most appropriate to the data under 











388 Use of transformations and maximum likelihood 


investigation. When this specification has been decided the transformation chosen should be 
such that the quantities that require estimation are simple functions of the transformed 
variate; thus in probit work it is anticipated that the transformed variate will bear a linear 
relation to the dosage when the latter is expressed in suitable units (usually logarithms), and 
the quantities that require estimation are the constant term and the slope of this regression. 
A secondary requirement is that in no part of the range (or at least in no part covered by the 
data) do the working values or the weights associated with the transformed variate become 
infinite. 
In the genetical example given above the transformation 


y = log (—logq) 
is obviously suitable. For with this transformation if Y,, and Y,, represent the transformed 


values of 7, and 7, we have 
Y,, = logA,+ log, + log 4,, 


Y;, = log A, + log w, + log 93. 


The effects of the treatments and of the variations in sensitivity are therefore represented 
by additive components in the transformed scale. This transformation, which is the same 
as the ordinary log log transformation with p replaced by g, may be termed the comple- 
mentary log log transformation. 

In many cases in which there is no clearly appropriate theoretical model, and in which 
there is no reason to differentiate between the two ends of the probability scale, the logit 
transformation 2 = }log(p/q) 
is likely to be suitable.When the p’s are all small the logit transformation is equivalent to 
one-half the complementary log log transformation, and when the p’s are all nearly unity 
the logit transformation is equivalent to one-half the ordinary log log transformation.* 

If the logit transformation is used the analysis will be conducted on the assumption that 
the treatment and sensitivity effects are additive on the transformed scale, i.e. that, except 
for random variations, the difference in the logits for the two treatments is constant from 
experiment to experiment. 

In certain circumstances the probit transformation may be considered to be more appro- 
priate than the logit transformation. Except at the extreme ends of the scale, however, 
there is little difference between them, and in the absence of any theoretical reasons for 
choosing one or the other it will rarely be possible to determine, by reference to the data 
themselves, which specification gives a better representation of the effects under investiga- 
tion. 


MAXIMUM-LIKELIHOOD EQUATIONS 


As has been demonstrated in a number of places, the general procedure of obtaining the 
maximum-likelihood solution when using a transformation of the form y = ¢(p), is as follows: 

(1) Determine a set of N provisional values Y of the transformed variate for the NV ob- 
served proportions. These can conveniently be deduced from preliminary (here called first) 
estimates of the required parameters or, when regression lines are involved, graphically. 


* In the author’s Sampling Methods (2nd edition, 1953) the logit transformation is defined as 
z = log p/q. This does not accord with modern usage, and was due to an over-hasty reference to 
Finney’s Statistical Method in Biological Assay (1952). 





Fo 


ch 


me 
de 


va 


th 


th 


Ww 


l be 
ned 
ear 
and 
ion. 
the 
yme 


ned 





F. Yarrs 389 


In many cases it pays, when making the first estimate, to use weights similar to those used 
in the further steps of the analysis. 

(2) For each observed proportion p (given by say n’ successes out of n observations) 
calculate a working value given (when y increases uniformly with p) by 


= 92 nex. +9¥ min. 
= Ynax.— 7(Range) 
= +min. + p(Range). 


where Yinax, and Y,,;,, are the maximum and minimum working values corresponding to 
the assigned provisional value Y, and Yuya. — Yin, = Range. Determine also the weight 
W to be assigned to the working value. This is given by nw, where w is the weighting 
coefficient. Yinax, Ymin, and w are functions of Y (or of the corresponding untransformed 
proportion P) whose form depends on the transformation. The general formulae are 


afd hes 2 dY\ 1 “fay? 
Yas. = ¥ +0(3p): Ymin. = ¥-P(5p): Range = (7p): w= 59] (a) - 


For the transformations commonly in use tables of these functions are readily available. 

If y decreases uniformly with increasing p the formulae for Y,,,, and Yj) are inter- 
changed and p and q are interchanged in the formulae for the working value y. 

(3) Using the working values as observed variates obtain revised estimates of the para- 
meters by the ordinary method of least squares, weighting each observation by the weight 
determined as above. 

(4) From the revised estimates of the parameters determine a new set of provisional 
values, and repeat the whole process as often as is necessary. 

In the case of the genetic example already discussed in large sample terms we may write 


log6, =t,, log@, = ts, 
logA, =1,, logu,=m, (r=1,...,k). 
t,, tg and the m, are the parameters that have to be estimated. 


If tj, t; and m; are first estimates of these parameters provisional values are given by 


the equations “ oss, ep 
q ¥y, = 1, +m, +h, | 3 
Y,, = 1, +m) +t.| (3) 


Experiment r gives the two observation equations 
+m, = ¥;,—!,+&, weight W,, 
tp +M, = Yo,—l.+€, weight W,,, 


there being & such pairs of equations in all. 
The ordinary method of least squares then gives the normal equations: 


t, &W,,+ om, W,, = =W,,(y1,—4,), 
t, =W, 7 =m, Wo, i=" =W2,(Yo, wr l,); 
mW, + Wa) +t, Woy + te War = WarlYar— Ip) + WarlYar—1,) (= 1-05), 


where © denotes summation over r from | to & and t,, t, and the m, now denote estimates. 
These equations are of the same form as equation (1) of Jolly’s paper. 











390 Use of transformations and maximum likelihood 
Since one of these equations is redundant we may conveniently put 
=m,(Wi,+ We) = 0. (4) 
The sum of the first two equations then gives 
t, 2W,,+t.2W,, = UW, yy, + DWo, Yo, — = (Wy, + Wa,) l,. (5) 
Substituting the values of m, given by the last k equations in either of the first two equations, 


we obtain 
W,, W, W,,.W,, 
a es } Ir tr Ir **2 
ts = W,+W,, “or vw)| | 27 + (6) 


Thus ¢, —t, is a weighted mean of y,,—y2,, with weights W,,W,,/(W,+ W,,). 
To obtain m, equations (5) and (6) may be solved for ¢, and t,. We then have 





1 
m,+l, = = W,,+ W,, {WY - t,) + W,,( Yor — ty)}. (7) 


When the logit transformation is used and the treatment and sensitivity effects are 
assumed additive on the logit scale, the solution is identical with that given above, except 
for the omission of the / terms. 


APPLICATION TO DROSOPHILA DATA 


The computations leading to the normal equations are set out in Table 4. The first estimates 
have been obtained by taking values of y corresponding to the observed p, and weighting 
coefficients corresponding to these y. Thus from Table XVI of Finney (1952) the y value 
corresponding to 1 —0-0344 is — 3-37, and from Table XVII the corresponding weighting 
coefficient is 0-033, so that W = 0-033 x 844 = 28. The values of y,—ys, W.+W and 
W,W,/(W, + W,) are then calculated. These last are used as weights in calculating the weighted 
mean of y, — Ya, namely, + 0-2990. The equations for t; and t; are therefore 


332t/ + 2908), = — 6505-33, 
ti —t, = +. 0-2990. 


The numerical term of the first equation is obtained from the sum of products of the y and 
W columns less those of the / and W, + W, columns. 
Solution in the ordinary manner gives 


t; = —10-3193, t, = — 10-6183. 


Equations (7) then give the values of m;,+/, shown in Table 4. 
Provisional values Y,, and Y,, are now calculated from equations (3). Thus 


¥,, = — 10-3193 + 7-0357 = — 3-28. 


The working values and weights are calculated as explained above, using the appropriate 
table (Table XVII of Finney). It may be noted that if, as here, interpolation is required 
it is easiest to calculate two values of y and interpolate between these. 


The remainder of the calculations proceed exactly as before, and lead to the second 
estimates shown in Table 4. 


our wwe 


aut WS hoe 

























































































F. Yates 391 
Table 4. Analysis of Drosophila data 
(4) 
Basic data Ist estimates 2nd estimates 
Exp. 
(5) my 100p, "1 WwW, Y, YW WwW, 
ns, 
1 844 3-44 — 3-37 28 — 3:28 — 3-349 31-2 
2 | 2700 2-89 —3-53 81 — 357 — 3-528 75-0 
(6) 3 | 3285 3-35 —3-37 108 —341 — 3-378 106-8 
4 | 1063 2-35 —3-72 26 —3-49 —3-711 31-9 
5 | 370 8-38 —2-43 ia — 2-30 — 2-428 35-3 
6 | 600 9-50 — 2-30 57 —2-37 — 2-301 53-6 
332 333-8 
(7) 
Rs 100p, Yeo W, Y, Ya W, 
are 
ept 1 | 1531 294 —3-53 46 —3-58 —3-509 42-1 
2 | 1465 1-84 —4-01 26 —3-87 — 3-979 30-3 
3 | 4381 2-28 —3-76 97 —3-71 — 3-767 106-0 
4 | 2105 2-47 — 3-68 51 —3-79 — 3-683 47-0 
5 | 550 7-82 —2-51 43 — 2-60 — 2-504 39-4 
6 | 463 5-83 — 2-82 27 — 2-67 — 2-803 31-0 
utes 
in, 
e 290 295-8 
lue 
ing 
and oF el A: ies. 
ted L Ywi-Y¥. |WitW, Wi+W, m, +l, Ww-Y2 |WitW, W,+W, mM, +l, Mm, 
1 6-8669 +016 74 | 17-4 | 7-0357 | +0160 | 73-3 | 17-92 | 7-0543 | +40-1874 
2 6-8669 +048 107 | 19-7 | 66853 | +0-451 | 105-3 | 21-58 | 6-7532 | —0-1137 
3 6-8669 +0°39 205 51-1 6-9062 +0-389 | 212-8 | 53-20 6-9010 +0-0341 
4 6-8669 —0-04 77 | 17-2 | 68238 | —0-028 | 78-9 | 19-00 | 68071 | —0-0598 
5 8-0064 +0-08 75 | 183 | 80149 | +0-076 | 74-7 | 18-62 | 8-0134 | +0-0070 
ind 6 8-0064 +052 84 | 183 | 7-9483 | +0502 | 84-6 | 19-64 | 7-9493 | —0-0571 
+0-2990| 622 | 142-0 +0-2937 | 629-6 |149-96 
ty — 10-3193 — 10-3265 
ry — 10-6183 — 10-6202 
From the small changes in the values of y, and y, and the still smaller change (from 
+ 0-2990 to +0-2937) in t,;—t,, which is trivial compared with the range of the values of 
ate which it is the weighted mean, it is clear that no further approximation is required. 
red (A further cycle actually gives a value of + 0-29341 for t, —t,.) 
The advantage of deriving the first estimates by weighting the directly transformed 
nd. proportions will now be apparent. The computations follow the same routine in each case 
(except for the calculation of working values) and a very satisfactory first approximation is 
26 Biom. 42 











392 Use of transformations and maximum likelihood 


obtained. Moreover, the similarity of the figures in the two sets of computations provides 
a useful check against gross errors.* 

The maximum-likelihood solution may be compared with the large-sample solution 
already obtained. For comparative purposes a correction to # is required to allow for the 
fact that squares of AO, and AQ, have been neglected in the large-sample solution. This is 
obtained by multiplying each component of EZ by the appropriate value of [1+ }(p, + p.)]. 
With this adjustment, and the corresponding adjustment in the weights, the large-sample 
estimate becomes 0-2930, with weight 147-8. The maximum-likelihood solution gives 
6,/0, = 1-3414, so that (6, — 4,)/4(0, +.) = 0-2916. The agreement between the two methods 
of estimation is thus very close, the discrepancy being less than 1/50 of the standard error. 


ESTIMATION OF ERROR AND TESTS OF SIGNIFICANCE 


Three types of test will commonly be required: 

(1) A test of the difference between the two treatments, i.e. a test of whether ft, —f, is 
significantly different from zero. Whether or not the difference is significant the estimated 
standard error of t, —t, should be given so that approximate fiducial limits can be assigned. 

(2) A test whether there is any real variation in sensitivity of experiments, i.e. whether 
the differences between the m, are significant. 

(3) A test whether the residual variation is greater than can be accounted for by binomial 
sampling. 

In the analysis of normally distributed data with variances in known ratios, i.e. with 
known relative weights, any set of parameters of which estimates have been obtained by 
the method of least squares can be tested for significance by finding the difference in the 
sum of squares accounted for when all parameters are included in the specification and that 
accounted for when the parameters under test are omitted from the specification. This 
difference is tested against the residual sum of squares by means of the z distribution. For 
the corresponding x? test on quantal material, however, the relevant x? cannot be exactly 
determined in this way, since changes in specification result in changes in the expectations, 
which in turn alter the expected variances. 

A simple example will illustrate this point. Suppose we have two sets of 4n observations 
(nm even), which are known to be derived from Poisson distributions whose means jf, and 4, 
are possibly different. If ~, = “, the value of y?, x}. say, based on the whole of the data will 
be approximately distributed as x? with n—1 d.f. Likewise the value of y?, y%, say, cal- 
culated from the difference of the sums of the two sets of observations, will be distributed 
as x? with 1d.f. Whether or not “, = “4, variation of the observations within sets will give 
two values of x?, yj and y3 say, each of which will be distributed as y* with 4n—14.f. It can 
easily be shown that 


x zx 
xb = xbt oats 2 


‘ X- 
= xbtxttxit 


z 


a 
*(x3-23), 





where Z, and %, are the observed means of the two sets and ¥ = }(%,+7,). If instead of 
X> the difference between the total and residual y*, i.e. y3.— x} — 3, is taken as the criterion 

* In general we may expect to save at least a cycle of iteration by initial weighting. In Dyke and 
Patterson’s example, for instance, the first estimates obtained in Sampling Methods by weighting are 
fully as accurate as Dyke & Patterson's second estimates. 














ah a sel 


>. a. © 


on PP 











F. Yates 393 


for testing the hypothesis ~, = ~, the discrepancy given by the last term of the above 
equation will be obtained. 

It is therefore advisable to use direct tests for each set of parameters, instead of relying 
on differences between x? values. The general rule for obtaining the variances of the 
estimates of parameters in maximum-likelihood solutions of this type is to invert the 
matrix of the coefficients of the final set of normal equations, thus obtaining a c matrix of 
the customary type. The variances and covariances are then given direct'; by the c values. 

In the present case the fact that one of the constants is redundant introduces a slight 
additional complication. The procedure has been given by Yates & Hale (1939). Suppose 
that the 4 normal equations Z, = 0, H, = 0,...,#, = 0 are governed by the identical 


relationshi 
P A, £,+A,£,+...+A,H,=9, (8) 


and that the relation fy by + gba t+... +My,b,=0 (9) 


is assumed between the estimates of the parameters. Then instead of replacing the numerical 
terms of the normal equations by 1, 0, ..., 0in order to obtain the equations for c,,, Cy», ---, Cyn; 
they must be replaced by 


— Ay Ai be An 
S(Ap)’ Sa)” S(An)’ 








Similarly the equations for C1, Cg, ...,€y,, Will be obtained by replacing the numerica. 
terms by 








oY ig Sigs 
S(An)’ Sy)” Ap)’ 
etc.* 
Designating the normal equations of our example by the suffices a, b, 1, 2, ..., k, we have 
—E,-H,+H,+H,+...+H;, = 90, 
so that A, =A,=-1, Ay=AQ=... =A, = Ae 
Also from (4) tg => =9, 2p =W,+W, (r=1,...,k). 


Consequently S(Aw) = &(W,,.+ W,,). The correction terms to be added to each of the first two 
sets of auxiliary equations are therefore 


0, 0, I> Jo, sees Js 


where g, = (W,,+ W,,)/=(M,+ W,,). For the remaining / sets the same correction terms are 
subtracted. 


Tests for the t’s 


The c’s can now be easily evaluated. All that is necessary is to make the appropriate 
changes in the numerical terms of the normal equations and utilize the solutions already 
obtained. 


* The process can be extended to two (or more) redundant constants by introducing additional 
correction terms —/,Aj/S(y’A’), ete., provided that all the sums of products of the type S(Ay’), 
S(A’), etc., are zero. The Yates & Hale paper is incorrect in that it omits this condition, and also assigns 
a numerical term //, instead of zero to equation (9). 


26-2 














394 Use of transformations and maximum likelihood 




















Putting U == Wot , we find 
ean = VC) Some) * Og We 
tan, 7.77, Patel Se SW, + We) vem Wa 
Hence V(t,—t,) = 5 = if Woh 
as could, indeed, be deduced directly from the form of the expression for ¢, —¢,, and 
Vi +8) = sar em tare 


Test for the m’s 


Using a similar procedure for set r, and putting 


i I , W,,2W,, — W,,=W,, 
. U(W,+ W,,) =(W,, + Way)’ 


1 1 











ys fo om = Vim) = a SW, +) +2 
1 
Cys = COV (M,,M,) = ~=(W, +h) +27 
1 1 
This gi - = T.—T.)?, 
is gives V(m,—™,) W.+ W,,* W,+W,,* -—4,) 


which, with a certain amount of algebra, can be shown to be equal to 


1 [ Wir Wie W,, Wr, 
LW, + Wi, Wo + Wea) 





This provides a check, since it corresponds to the expression for V(t, —t,) for two experi- 
ments, with interchange of treatments and experiments. 

Apart from the 7' terms the variances and covariances of the m’s are those of quantities 
M,—M, where the M, are independently distributed with variances 1/(W,,+W.,) and M is 
the weighted mean of the J. If the true values of the m’s are all zero, therefore, the 
weighted sum of squares 


D(W,, + Wap) m2 


will be distributed as y? with k—1d.f., apart from the disturbance due to the 7’ terms. The 
weighted sum of squares has the expectation 


=(W,, + W,,) V(m,) = k— 1+ UST(W,, + Wi). 


If the 7' terms are not sufficiently small to be neglected entirely the value of the second term 
may be deducted before making the y* test, but this correction will usually be small. Since 
the second term is always positive it need not in any case be calculated if significance is not 
attained before taking it into account. 














If 


ie 


e 











F. YATES 395 


Residual x? 
The general formula for the residual x? is the sum over all cells of 


(observed no. .— expected no.)? 
expected no. . 





If, as in the present case, the data consist of a set of observed proportions, this reduces to 


(a—m)? , (a—m)*\ _ (PPP 


m n—m | PQ 





S 


when a/n = p is the observed proportion, m/n = P = 1—Q the expected proportion given 
by the maximum-likelihood solution, and S denotes summation over all proportions. 

We may denote the provisional and working values and weights used for the sth estimates 
by Y,, y, and W,. The provisional values Y, will equal the expected values derived from the 
(s —1)th estimates. We have 


aY 
Ys = ¥,+(p—F) Ge) - 
Consequently the residual y? based on the (s— 1)th estimates is given by 


(Xp)s—1 = Sn (ys— ¥,)? (35) 








P.Q, \dY), 
= SW.(y.—Y,)?. (10) 
The residual y* obtained from the expected values derived from the sth estimates is 
given by ; 
q (Xp)s = S{Wora(Youa — Your)? }- (11) 


If the estimation process is stopped after the sth estimates it will be necessary to calculate 
specially Y,,,, ¥,,,; and W,,, to evaluate (x?),. Usually, however, (x?),_, will be sufficiently 
accurate. 

Finney (1952) recommends the use of the ordinary formula for the sum of the squares of 
the residuals when regression coefficients or constants b,, by, ... are fitted by least squares. 


This is SW(y— Y)? = SWy?—b,B,—6,B,-..., 


where B,, B,, ... are the numerical terms of the normal equations with leading terms ),, bg, .... 
A little consideration will show that if the sth estimates of the parameters and the numerical 
terms of the sth set of normal equations, together with y, and W,, are used to calculate the 
right-hand side of the above equation, we shall obtain exactly 


SW(Ys — Ys)”. 


This, however, is not likely to give a substantially closer approximation than expression 
(10). The latter is easier to compute, and also has the advantage of revealing aberrant 
values. 

When some of the expectations are very small it is well known that the x? test is un- 
reliable. In particular, positive deviations from very small expectations will make exces- 
sively large contributions to the residual y?. Under such circumstances Fisher (1950) has 
shown that 2S{a log, (a/m)}, tested against the ordinary x? distribution with the customary 
number of degrees of freedom, gives a much better measure of the significance of the 
deviations, and is well fitted to take the place of x? when the expectations are small. An 








396 Use of transformations and maximum likelihood 


example of its use is given in example (c) below. This criterion requires somewhat more 
computation for its evaluation than y?, since it cannot be simply derived from the trans- 
formed values. In our notation the contribution of each observed proportion is 


2n{ p(log, p — log, P) + q(log, q—log, Q)}. 


ESTIMATION OF ERROR AND TESTS OF SIGNIFICANCE IN THE DROSOPHILA DATA 


The above formulae can now be applied to the Drosophila data. Taking the values of the 
weights used for deriving the second estimates (Table 4) we have 


V(t,—t,) = 1/149-96 = 0-08172. 
Hence t,—t, = +0-2937+0-0817. Reference to the normal probability integral gives 
P = 0-000163. Hence the significance of the difference is clearly established. If the actual 


mutation rates are of interest the standard errors of t,, ¢, and }(t, +¢,) will also be required. 
These are given by 








1 295-82 
= = 0 2 
V(t) = §59-6 + 149-96 x 620-62 ~ 90553", 
1 333-82 
le) = = 0- 2 
Vite) = §59-6+ {49-96 x 609-62 ~ 70588", 
1 333-8 — 295-8)? 
Vi(t, +t) = ( Y’ _ 9.0399%. 





629-6 4x 149-96 x 629-6? 


The test for variation in sensitivity between the experiments is equivalent to testing that 

the true values of m, are zero. We have 

=(W,, + We,) m2 = 4-745, 
which has to be compared with y? with 5d.f. There is therefore no evidence of hetero- 
geneity between experiments. Taking the 7, into account increases the expectation of the 
above expression by 0-0458. Clearly the 7, are of no consequence here. 

Using expression (10) and the provisional and working values of the second approximation 
gives a value of x? of 5-15. Since the maximum-likelihood solution is very close to the mini- 
mum x? solution, it is to be expected that this value of the residual y? will be slightly too 
large, and in fact the provisional and working values of the third approximation (equivalent 
to the second approximation estimates) gives a value of 5-11. The value based on the 
provisional and working values of the second approximation is clearly adequate for practical 
purposes. 

OTHER EXAMPLES 

As further examples, which bring out some additional points, we may take the three sets 

of data given by E. 8. Pearson (1950). 


(a) Road research data 


Table 5 gives the number of cycles and motor-cycles involved in accidents causing 
personal injury on sections (6 to 15 miles long) of five main roads near London, together 
with estimates of the miles travelled by these two types of vehicle on these sections. A com- 
‘parison of the probabilities of being involved in an accident per vehicle mile is required. 

Since a vehicle which is involved in an accident is either removed from the read or is 
again exposed to risk of further accident the number of accidents will follow a Poisson 
distribution. If the probability of accident per vehicle mile is 0 the expected number of 
accidents per n vehicle miles will be n@. Following the procedure of the previous example, 














we 


bo 





tk 


at 


ts 














F. Yates 397 


we may denote the probability of accident for cycles and motor-cycles on road r by y,0; 
and 4,0,. This will make allowances for differences between the different roads which affect 
both probabilities proportionally. 


Table 5. Road Research data 





























Cycles Motor-cycles 
Road ) 
Accidents Miles/10* | Pp Accidents Miles/10* p 
| | 
| | 
1 5 | 1690 | 0-0030 4 800 0-0050 
2 2 1800 0-0011 2 500 00040 
= 3 1820 | «= 00016 1 330 00030 
| = 4 | 1790 | ©0022: | 3 570 00053 
| 5 3 | 1660 | 00019 | 5 540 0-0093 
| | 
arin Sone aaa 
17 | 8660 | 000196 15 2740 0-00547 
| | 
| | I 











The Poisson distribution with mean m can be regarded as the limit of the binomial dis- 
tribution (¢+p)", in which p> 0 and noo in such a manner that np = m. Consequently, 
provided the unit of distance travelled is taken sufficiently small for the chance of two 
accidents occurring in the same unit to be negligible, the data can be treated as if they were 
binomial. Here a unit of 1000 vehicle miles is sufficiently small. As before the complementary 
log log transformation is appropriate. Since 

y = log, (—log.¢) > log. p 

as p—>0, a table of natural logarithms can be used for the actual transformation. Working 
values and weighting coefficients can be obtained from Finney’s Table IX, which has 
adequate range for this purpose. Alternatively, if x is the number of occurrences a table of 
natural logarithms may be used with 

W = nw = ne’, 

y= Y-1+2/W. 
With this procedure there is no need to choose units such that p = 2/n is small. 

The computations follow the same lines as those of the previous example, and need not 
be reproduced in detail here. For cycles on the first road, for instance, 

n= 1690, n’=5, p=0-0030, log.p = —5-81, 


and we therefore have: 








¥ y Ww 
Ist estimate — _ —581 5 
2nd estimate — 6-0 — 5-79 4:2 




















the provisional value Y = — 6-0 being given by the first estimate of t,+m,. For the first 
estimate W = n’, since w = P = p. 











398 Use of transformations and maximum likelihood 


The values of the first and second estimates of the parameters are shown in Table 6. The 
second set of estimates differs little from the first and is clearly sufficiently accurate. 

The average difference is clearly significant, motor-cycles being estimated as about 2-6 
times as liable as cycles to be involved in an accident per 1000 vehicle miles. Taking + 1-65 
times the standard error, we obtain approximate 5°% lower and upper limits of error of 
this ratio as 1-5 and 4-7. 

The values of the various x? are shown in Table 7. There is no evidence of variation in risk 
between the different roads, or of any variation of the difference between cycles and motor- 






































les. : 
eye Table 6. Road Research data: 1st and 2nd estimates 
| | | | 
Ist 2nd | Ist | 2nd 
| | 
ae lees 
ty —6-1597 | -—6-1686 m, | +01420 | 40-1304 
te — 5-1824 | — 5-1973 Mo -- 0:4930 —0-4938 | 
t,—t, —0-9773 | —0-9713 Ms | — 0-3671 — 0-3633 
WW, | | 
oo 7-56 7: — 0-0020 0-0117 
| “Wit, P Fy ~ | 9 
S.E. (t, —te) = + 0:356 m, | +0-2726 | +0-2423 
| | 
Table 7. Road Research data: values of x? 
| | 
D.F. x? 
ee s | 
Average difference (¢, —t,) 1 7-41 | 
Variation between roads (m’s) 4 2-12 
Variation in difference (residual) 4 1-40* 
ih "AE ae Bech arene pa i 
| | 
Total | 9 10-93 | 








* From SW,(Y,—y.2). SW3(Y3;—ys) gives 1-38. 


That there is no appreciable variation between roads can in fact be seen by inspection of 
the values of p, and p, in the original data; such variation would result in a positive correla- 
tion between the values of p, and p,. This fact suggests a short cut to the whole analysis, 
since if there is no variation between roads or in the difference between cycles and motor- 
cycles the data for each type of vehicle may be pooled. A combined test for absence of such 
variation may be quickly made from the y? for the variation of the ratios for each vehicle 
separately. These x? can be calculated in the ordinary manner and are found to be: 








D.F +r 
Cycles + 1-67 | 
Motor-cycles 4 2-01 | 





8 3-68 























8a. 


un 





F. Yates 399 


Using the pooled data we may estimate immediately that cycles and motor-cycles have 
risks of 0-00196 and 0-00548 per 1000 vehicle miles of being involved in an accident. These 
agree closely with the estimates of 0-00209 and 0-00553 given by the previous analysis. 

The possibility of pooling fragmentary data should always be examined. If pooling is 
justified the numerical work is considerably reduced and the presentation of the data is 
simplified. In addition, the amount of information is increased when heterogeneity between 
the different sets of data can be ignored. 


(b) Artificial insemination data 


These data (Table 8) are the result of an investigation (Rothschild, 1949) to test whether 
the passage of a small electric current for the measurement of the activity of the spermatozoa 
had any adverse effect on the fertility of bull semen. Seven samples were each divided into 
two parts, one being subjected to the electric current. The half-samples were then diluted 
and used to inseminate heifers. The columns headed ‘returns’ give the numbers of heifers 
which did not become pregnant, as judged by their being returned by their owners for 
reinsemination. 


Table 8. Artificial insemination data 


























| 
| Untreated semen | Treated semen 
| Sample 
| i 
Returns | Total p | Returns | Total | Pp 
} at ea | , 
| | | | { 
pital spritgion dil! age 0-455 | gic prvi oi | 
& owd gdm oop staal 0-333 Gcordeds & 0-444 | 
- 5. \dewoblt 0-25 | 2. itt 5 0-6 
7 et ae 0-545 | 3 | ‘: |. ous. oi 
gio ead aad 0-4 4 O°). Anuag 
6 Pinoll ll | 0-273 1 | 5 | O82 | 
7 +) 303 4 | 0-25 hte 6 6| 06 
Fe Ee viii Toate § | . al 
| 
25 68 0-368 | 15 48 0-312 | 





Inspection of the individual p values gives no indication of any heterogeneity between 
samples. The uniformity of the data is confirmed by the x? between samples for treated and 
untreated semen separately. The values are: 








D.F. x* 
Untreated 6 3°32 
Treated 6 7-92 





Total 12 11-24 


























400 Use of transformations and maximum likelihood 


The data may therefore be safely pooled, giving a value of y* (uncorrected) of 0-378, and 
corrected for continuity of 0-174, corresponding to probability levels (one tail) of 0-27 
and 0-34. 

The former probability may be compared with the result obtained by Rothschild, who 
analysed the data by means of the angular transformation, obtaining a probability (one tail) 
of 0-11. The poor agreement is not (as might be supposed) primarily due to the effect of 
pooling, but to the fact that Rothschild analysed directly transformed values without any 
final adjustment of the type that is automatically introduced by the use of provisional and 
working values. Ifa common provisional value is taken for the treated samples, and a second 
common provisional value for the untreated samples, an estimate of treated-untreated 
(in terms of angles) of — 4-41 + 5-50 is obtained, giving a P (one tail) of 0-21. Similar treat- 
ment by the probit transformation also gives a P (one tail) of 0-21. This illustrates the im- 
portance of including these final adjustments not only in cases of ‘special accuracy’ as was 
implied in the earlier editions of Statistical Tables (Fisher & Yates), but in all cases in which 
their use materially affects the results. 

Although there is no evidence of any harmful effect of the electric current, the treated 
semen being apparently the more fertile, it should be noted that the data are not sufficiently 
extensive to exclude all possibility of a harmful effect of some practical importance. Use of 
the logit transformation on the pooled data gives an upper 5 % limit to treated-untreated 
of +0-23, which is equivalent to a difference in fertility, (1—p), of 0-10 in favour of the 
untreated semen. 


(c) Safety in mines data 


The data were obtained in course of an investigation at the Safety in Mines Research and 
Testing Branch of the Ministry of Fuel and Power, concerning the relative safety of mining 
explosives. Table 9 shows the result of six separate experiments comparing two explosives, 
A and B, under a number of different conditions. The cartridges of explosive were fixed in 
a gas chamber and a record was made of whether each shot resulted in an ignition or non- 
ignition. The weight of cartridge and the conditions of the explosion were varied in the 
different comparisons, but the details of these variations are not included in Pearson’s 











ad Table 9. Safety in Mines data 
Bia } 
Explosive A | Explosive B 
Comparison |——____ | 
| Ignitions | Total shots Pp Ignitions | Total shots | p 
| tent 
Biv. hose # 10 0-8 rey 5 1-0 | 
2 2 10 0-2 13 | 15 0-867 
3 4 5 0-8 | 5 5 1-0 
4 | 1 5 0-2 1 10 0-1 | 
5 | 1 5 0-2 0 10 0 
6 | 0 10 0 | 4 6 0-667 
| | | 











Inspection of Table 9 indicates that there is considerable heterogeneity between experi- 
ments, with Exps. 4 and 5 at one extreme and Exps. 1 and 3 at the other. The ratio of n,, 
to mg, also varies considerably from experiment to experiment. Pooling would therefore 





be 


ap 
be 
ob 
tré 
tre 
fo 


es 
co 


F. Yates 401 


be misleading. As in the case of the insemination data the logit transformation is most 
appropriate. The procedure of analysis already given in detail for the Drosophila data can 
be followed, using the table for the logit transformation. There is an initial difficulty in 
obtaining first estimates, however, since some values of p are 0 or 1. In these cases the 
transformed values are +0o with zero weight. These were replaced by the following (arbi- 
trary) values; 3-0 for Exp. 5(B), and 6(A), in both of which 0/10 were observed, and 6-8 
for 1(B) and 3(B), in both of which 5/5 were observed. 


Table 10. Safety in Mines data; successive estimates of t, —t. 








Estimate t,—te Weight 
1 —0-9183 9-8 
2 — 1-1502 10-4 
3 — 1-2125 8-29 
4 — 1-2129 7-64 


























With these starting values the sets of estimates shown in Table 10 were obtained. As will 
be seen convergence is relatively slow and three sets of estimates are required to attain 
satisfactory accuracy, with a fourth set to confirm that the approximation has been taken 
far enough. The slow convergence is not primarily due to a bad choice of starting values for 
the cases where p = 0 or 1. The main reason is the anomalous behaviour of Exps. 4 and 5. 
This results in very low final provisional values for explosive A in these two experiments, 
with correspondingly large working values and small weights. The values for Exp. 5, the 
more extreme of the two, in the successive approximations are as follows: 








Contribution 
" Y, ¥s W, to (XF)s 
1 — 4-31 3-2 2-6 
2 3-5 5-19 0-9 11-8 
3 2-9 9-26 0-29 18-6 
4 2-7 12-34 0-20 19-8 
5 2-67 13-04 0-185 — 























The change of y from y, to y,, although apparently large, makes little difference to the 
estimates, largely owing to the corresponding decrease in the already small weight. The 
contribution to x? also reaches relative stability when s = 3, the change with s = 4 being 
mainly due to taking Y, to only one decimal place. The values of x? are as follows: 











D.F bey 
Average difference (¢, —t,) ’ 1 11:3 
Variation between experiments (m’s) 5 33-8 
Variation in difference (x?) 5 31-8 
ll 76-9 


























402 Use of transformations and maximum likelihood 


The residual x? is, however, unreliable in this set of data owing to the small expectations 
of the non-zero classes of explosive A (ignitions) in Exps. 4 and 5. The likelihood criterion 
— 2S {a log, (a/m)} has the value of 13-41, giving a significance level P = 0-020. This may be 
compared with the significance level obtained by combining the data into the two groups, 
namely, Exps. 4 and 5 and Exps. 1, 2, 3 and 6. (This particular combination is valid for the 
purpose of a test of significance, since 4 and 5 are the two experiments with low values of 
n'|n.) A 2x 2x 2 table then results. For the interaction of such a table, as Bartlett (1935) 
has shown, the exact probabilities of the various possible occurrences with all marginal 
totals fixed can be calculated in a similar manner to that followed in a 2 x 2 table. The 
resultant test is in fact a test for the constancy of the differences of the logits.* Bartlett’s 
test gives a probability of 0-0147 of two or more of the three ignitions in Exps. 4 and 5 being 
associated with explosive A. Bearing in mind (a) that practically the whole of the dis- 
crepancy is in this test concentrated in a single degree of freedom, but that, on the other hand 
(b) there is a large effect of discontinuity, not present in the likelihood criterion, the agree- 
ment between the two tests is satisfactory. 

The indication of the residual y? that these experiments are anomalous is therefore 
confirmed, though the significance level is much less than that given by the y? value. Tuis 
result is not, however, of great consequence, since it merely provides evidence of the 
incorrectness of the supposition that the difference in logits between the two explosives is 
constant over the whole of the probability scale. Our conclusions from these data should 
be that explosive B produces significantly fewer ignitions than explosive A when conditions 
are such that the probability of ignition is substantial, but that there is no evidence of any 
difference when the probability is low. Substantially more data at low probabilities would 
be required, however, to reach any firm conclusions as to what is really happening in this 
region. 

SUMMARY 


The analysis of sets of quantal experiments with two treatments by the use of transforma- 
tions and maximum likelihood is discussed, and illustrated by examples, one of which is 
also treated by large-sample methods. Methods are given for (a) obtaining an estimate of 
the treatment difference in terms of the transformed variate, together with its standard 
error, (6) testing for differences in sensitivity between the different experiments, and (c) 
testing for variation in the treatment difference from experiment to experiment. The 
appropriateness of various transformations under different circumstances is also considered. 


REFERENCES 
BaRTLETT, M. 8. (1935). J. R. Statist. Soc. Suppl. 2, 248. 
Bonnier, G. & Linina, K. G. (1953). Hereditas, Lund., 39, 193. 
Cocuran, W. G. (1942). Iowa State Coll. J. Sci. 16, 421. 
Cocuran, W. G. (1952). Ann. Math. Statist. 23, 315. 
CocHran, W. G. (1954). Biometrics, 10, 417. 
Dyke, G. V. & Patrrerson, H. D. (1952). Biometrics, 8, 1. 
Finney, D. J. (1952). Statistical Method in Biological Assay. London: Griffin. 


’ Fisner, R. A. (1932). Statistical Methods for Research Workers (4th ed.). Edinburgh: Oliver and 
Boyd. 


* Simpson (1951), following a suggestion by Bartlett, has drawn attention to certain errors in the 
latter’s paper, which do not, however, affect the test for interaction. Lancaster (1951) has criticized 
Bartlett’s test on what seems to me the invalid ground that the components of x? associated with this 
test are not additive. 





1 is 


ard 


[he 


and 


the 
ized 
this 








F. YaTEs 403 


FisHER, R. A. (1950). Biometrics, 6, 17. 

FisHer, R. A. & YATES, F. (1938). Statistical Tables for Biological, Agricultural and Medical Research 
(4th ed. 1953). Edinburgh: Oliver and Boyd. 

Jotty, G. M. (1950). Ann. Appl. Biol. 37, 597. 

LANCASTER, H. O. (1949). Biometrika, 36, 370. 

LANCASTER, H. O. (1951). J. R. Statist. Soc. B, 13, 242. 

Pearson, E. S. (1950). Biometrika, 37, 383. 

Pearson, K. (1933). Biometrika, 25, 379. 

RoTHSCHILD, Lorp (1949). J. Agric. Sci. 39, 294. 

Smpson, E. H. (1951). J. R. Statist. Soc. B, 13, 238. 

Yates, F. (1953). Sampling Methods for Censuses and Surveys (2nd ed.). London: Griffin. 

Yates, F. (1955). Biometrika, 42, 404. 

Yates, F. & Hae, R. W. (1939). J. R. Statist. Soc. Suppl. 6, 67. 














[ 404 ] 


A NOTE ON THE APPLICATION OF THE COMBINATION OF 
PROBABILITIES TEST TO A SET OF 2x2 TABLES 


By F. YATES, F.R.S. 
Rothamsted Experimental Station 


INTRODUCTION 


The combination of probabilities test, or probability integral test, as it is also called, has 
been commonly used to give an overall test of significance for a set of 2 x 2 contingency 
tables for which pooling is not regarded as admissible because of heterogeneity of the data. 
Data of this type can, in my opinion, be more effectively analysed by the method of maxi- 
mum likelihood, using some appropriate transformation; the methods of analysis are 
described in a separate paper (Yates, 1955). Cases will, however, arise where a quick, 
possibly preliminary, test of significance is required, and for this purpose the combination 
of probabilities test or some analogous test may be regarded as adequate. It therefore seems 
worth putting on record certain points concerning tests of this type. 

The combination of probabilities test requires the calculation of a significance level P 
for each of the k tables separately. The test criterion S(— 2 log, P) is then compared with the 
x? distribution with 2k degrees of freedom. We usually require to test whether the data 
collectively show evidence of deviation in a common direction, and in such cases the signi- 
ficance level P must be that of the appropriate single tail of the corresponding distribution. 

The test is exact if the basic P distributions are continuous with uniform probability, 
i.e. if each P can be regarded as uniformly distributed between 0 and 1 when the hypothesis 
to be tested holds. If the P’s are derived from 2 x 2 contingency tables with small marginal 
totals, however, the distributions of P are by no means continuous. As has been pointed out 
by Pearson (1950) they can be made continuous with uniform probability by the following 
device. If for any particular 2 x 2 table the probability derived from the hypergeometric 
series is P,(H) that the number in a particular cell is >a, and the probability of this number 
being >(a+1) is P,,,(A), then if a is the number observed we may use a value P,,(H) 
instead of P,(H), where P,,,(H) is a random selection with uniform probability from 
values in the range P,(H) to P,,,(). 

Since all the values of P will be decreased by applying this adjustment, the value of 
S(—2log, P) will be increased. Without the adjustment, therefore, we shall consistently 
over-estimate the combined probability of obtaining a set of P’s whose product is as small 
as or smaller than that observed. This confirms the result obtained by Lancaster (1949) in 
the course of a detailed investigation on the combination of probabilities from discrete 
distributions. 

To overcome this difficulty Lancaster suggested that the average value of — 2 log, P,,,(H) 
~ be used instead of the value — 2 log, P,(H). This average value he denoted by y?,, but to avoid 
confusion with the various y? with 1d.f. appertaining to the individual experiments, we 
will here denote it by —2log, P,,(H). We have 


— 2log, P,.(H) = Xm = B{—2 log, P,,9(H)} 
= 2—2{P,(H) log, P,(H) — Pa4.(A) log. Pa+1(H)}/{Pa(A) — Pa+1(A)}- 





Asaf 
value 


Neith 
but t 
Pears 
arbiti 
high : 





F. Yates 405 


As a further alternative, to save computation, Lancaster suggested what he called the median 
value x, which he defined by 


Xs =-2 log, 3{P,(H) + Pyi1(H)} Py4i(H) + 0, 
= 2—2log, P,(H), P441(H) = 9. 
Neither of Lancaster’s procedures, of course, gives an exact y* distribution with 2kd.f., 
but the departures will be very slight in all cases likely to be met with in practice. By 
Pearson’s procedure an exact y? distribution can be generated, but this introduces an 
arbitrary element which, as Pearson himself recognized, may be regarded as altogether too 
high a price to pay for a formally exact test. 


UsE OF x? TO DETERMINE THE VALUES OF P 


If the number of terms in the hypergeometric series appertaining to a single experiment is 
at all large the calculation of the exact probabilities is a tedious matter. An alternative 
procedure is to calculate the values of xy? for the individual experiments without the cor- 
rection for continuity, and use these to determine the corresponding one-tail probabilities. 
Cochran (1942, 1952), correcting a misleading statement by myself (1934), pointed out that 
the continuity correction should not be applied when summing x? from a number of 2 x 2 
tables, though the reason he gives in his 1952 paper, that ‘the correction has a tendency to 
over-correct’ and that ‘the over-correction mounts up in a disconcerting manner’ is not, 
as we have seen above, the basic one. Lancaster discussed the use of x? without correction for 
continuity for the combination of probabilities test, but confined his attention to the calcula- 
tion of the two-tail probabilities. Unfortunately, as already mentioned, this is not what is 
usually required in practice, and Lancaster’s argument is not applicable to the one-tail case. 

The question of the accuracy of the y? approximation, therefore, still requires investiga- 
tion. When x? corrected for continuity, x? say, gives a good approximation to the one-tail 
probabilities, it may be expected that the use of y? without correction for continuity, y? say, 
for the calculation of the probabilities for combination will give a reasonable approximation 
to the test based on the exact probabilities and —2log, P,,(H) as defined above. For if 
P,(xX,,) is the one-tail probability given by 2, when the value a is observed, and P,(x,) is 
the corresponding probability given by y?, then P,(y,,) is intermediate between P,(x,) and 
P..4(X-), and will not differ greatly from P,,(x,). 

It is well known, however, that if the relevant hypergeometric distribution is very asym- 
metrical P,(y,) is not a good approximation to P,(H) (Yates, 1934). Consequently, if the 
data are such that the hypergeometric distributions for the separate tables are very 
asymmetrical, and particularly if the asymmetry is in the same sense for all tables, the use 
of x? without correction for continuity instead of the exact probabilities may be expected 
to give misleading results. 

Some comparisons made for actual 2 x 2 tables, however, are reassuring. Tables having 
the following marginal totals (treatments 1 and 2) were taken: 


Table A Table B Table C 

l 2 1 2 1 2 
+ a 8 + a 8 + a 16 
a | 192 = 142 “ 224 




















406 Combination of probabilities test to a set of 2 x 2 tables 


Table A gives a symmetrical distribution and Tables B and C give increasing degrees of 
asymmetry. (No particular merit is claimed for these tables; A was originally chosen for 
another purpose, and B and C were taken to give distributions differing from A mainly in 
degree of symmetry.) r 

For these three tables the values of —2log, P,,(H) and —2log, P,(x,,) were calculated 
for all possible values of a. For Tables B and C there are two sets of values depending on the 
direction in which the deviations are measured. The results are shown in Table 1. In 
distribution A all the individual values except that for a = 8 (which will only occur in 1 out 


Table 1. Comparison of — 2 log, P,,(H) and — 2 log, P,(x,,) 





















































Distribution A 
Test for 1>2 
a p(a) 
— 2 log, P.,(A) — 2 log, Pa(Xu) 
0 0-003377 0-004 0-004 
1 0-029052 0-036 0-030 
2 0-107092 0-181 0-155 
3 0-220948 0-583 0-536 } 
4 0-279062 1-413 1-386 
5 0-220948 2-842 2-894 
6 0-107092 5-055 5-195 
7 0-029052 8-332 8-382 
8 0-003377 13-383 12-484 ‘ 
1-000000 2-000 2-004 k 
} £ 
I 
Distribution B t 
Test for 1>2 Test for 1<2 Cc 
a p(a) 
— 2 log, — 2 log 
—2 log, P,(A — 2 log, P(x. - 5 
- ; - aX) Ps-a)+(H) Ps—a(Xu) 
"7 
0 0-035397 0-036 0-040 8-682 7-833 | l 
l 0-152244 0-239 0-210 4-571 4-616 k 
2 0-277764 0-805 0-724 2-302 2-384 é 
3 0-280688 1-905 1-839 1-021 1-017 } 
4 0-171775 3-662 3-767 0-371 0-330 8 
5 0-065168 6-178 6-646 0-102 0-073 a 
6 0-014962 9-581 10-56 0-019 0-010 t 
7 0-001900 14-11 15-56 0-002 0-001 
8 0-000102 20-38 21-67 0-000 0-000 
P 
} ti 
1000000 2-000 2-021 2-000 1-989 q 





























3 of 
for 
y in 


ted 
the 

In 
out 














F. YatrEs 407 


Table 1. (cont.) 


























Distribution C 
Test for 1>2 Test for 1<2 
me P(2) — 2 log, — 2 log, 
-2 log, P4.(A) —2 log. P (Xu) | Ps—a)+(H) P6-a)(Xu) 
| 
| | 
0 0-048696 0-049 0-065 | 8-044 6-881 
1 0-168463 0-288 0-264 | 4-190 4-182 
2 0-264922 0-874 | 0-776 2-152 2-268 
3 0-251227 1-907 1-791 | 1-011 1-050 
4 0-160692 3-426 3-460 | 0-416 0-390 
5 0-073459 5-442 5-890 | 0-144 0-108 
6 0-024809 7-955 9-148 0-041 j 0-021 
. 0-006309 10-96 13-27 0-009 0-002 
8 0-001220 14-46 | 18-29 | 0-002 0-000 
9 0-000180 18-47 24-23 | ete 
10 0-000020 set oe 
11 0-000002 | 
| 
| | 
0-999999 | 2-000 2-030 2-000 1-975 














of 300 experiments) agree to within 0-15, and the expectation of — 2log, P,(y,,) is almost 
exactly 2. Distributions B and C show considerably greater individual differences, as is to 
be expected, but the large differences are again confined to the tails, where they are of no 
great consequence; for if values occur in the tail in any substantial proportion of the experi- 
ments the significance of the combined results will not in any case be in doubt. The expecta- 
tions deviate in opposite directions for the two types of test but are still very close to 2. 

On this last point it may be noted that Lancaster defines (without comment) what he 
calls the crude x? of a 2 x 2 table as 
»_ __ (N= 1) (ad—be)? 

(a+b) (c+d)(a+c)(b+d)’ 





x 


The customary form, which has been used in this paper for x2, has the factor N instead of 
N—1 in the numerator. Lancaster’s form has the property, which is useful for the com- 
bination of two-tail tests, that the expectation is 1. This, of course, does not make the 
expectation of — 2 log, P,(v) equal to 2, but it does in fact bring it closer to 2 in the case of 
symmetrical distributions. It can do little, however, to improve the expectations for 
asymmetrical distributions, since the deviations are of opposite sign for the two types of 
test. 

From the above comparisons we may conclude that the determination of the one-tail 
probabilities from x? without correction for continuity will be satisfactory in cases likely 


to be met with in practice, even when the expectations in the individual experiments are 
quite small. 


27 Biom. 42 











408 Combination of probabilities test to a set of 2 x 2 tables 


VARIANTS OF THE TEST 


The use of values of x? for 2d.f. for the combination of probabilities is to a certain extent 
arbitrary. It has the convenience that the values are easily calculated, and the use of a 
function of the product of the probabilities has a certain intuitive appeal, but the method 
would work equally well with other basic numbers of degrees of freedom. If, for instance, 
the values of x? for 1 d.f. corresponding to the P’s are summed then in the absence of associa- 
tion the sum will be distributed as y? for kd.f. Equally—and in the type of data we are 
considering this procedure requires even less computation—the values of y,, in the absence 
of association approximate to normal deviates with zero mean and unit standard deviation, 
and their sum is therefore a normal deviate with a standard deviation of ,/k (see, for example, 
Cochran, 1954). 

The calculations for these variants for the Drosophila data analysed in the paper referred 
to (Yates, 1955) are exhibited in Table 2. The combination of probabilities test (column 4) 
gives a significance level (one tail) of P = 0-000610, whereas the use of x? for 1 d.f. instead 
of 2d.f. (column 5) gives P = 0-000760, and S(y,,) (column 2) gives P = 0-000673. 


Table 2. Calculation of significance levels for the Drosophila data 














| Exp. | Xe Xu PalXu) —2 log, Pa(Xu) XitP o(Xu)} 
| | 
| 
| (1) | (2) (3) | (4) (5) 
1 0-4448 | 0-667 0-2524 | 2-758 1-309 
} 2 | «#2974 | 2056 | 001989 | 7-835 5-420 
ers. 8-0065 | 2-830 | 0-002327 | 12-126 9-272 
f +i @ 0-0418 | —0-204 | 05808 | 1-087 0-305 
| 5 | 00938 | 6-306 0-3798 | 1-936 0-771 
| 6 | 48325 | 2-198 001397 | $542 6-042 
|_ 
| 
| | 17-6468 | 7-853 | di. | 34-279 23-119 
| P | 000718 | 0-000673 ash 0-000610 0-000760 











These values are in close agreement. Indeed, except when the data from the different 
experiments are mutually contradictory no great differences are to be expected between the 
different possible methods that suggest themselves, since they merely demarcate equi- 
probable contour surfaces in the multiple P space which are not markedly different. Never- 
theless, alternative tests which are merely variants based on the same information are 
likely to give rise to confusion, and it is therefore recommended that some standard pro- 
cedure is adopted whenever a test based on the levels of significance in the individual 
experiments or sets of observations is required. If the information presented to the statis- 
tician consists of the relevant probabilities the combination of probabilities test commends 
itself both on historical grounds and because of its simplicity and intuitive appeal. If a 
one-tail test of a set of 2 x 2 tables based on the y? values for each table is required, however, 
there seems little point in transforming the values of y,, into probabilities and then re- 
transforming these into x? values. It is therefore recommended that in such cases S(y,) 
be adopted as the standard test criterion. 

















F. Yates 409 


In contrast with the above values of P it may be noted that S(x2) (column 1), which in the 
absence of association would also be approximately distributed as y? with 6d.f., gives a 
value of P = 0-00718. This considerably larger value of P is to be expected, since the 
direction of the deviations is not taken into account. 


EFFICIENCY OF THE TEST 


The significance levels given by the combination of probabilities test and by the ratio of 
t, —t, to its standard error in the maximum-likelihood solution for the four examples of the 
other paper (Yates, 1955) are as follows: 





Combination of Maximum 














probabilities likelihood 
| 
Drosophila data 0-000610 0-000163 | 
Road research data 0-00691* 0-00317 
Artificial insemination data | 0-64* | 0-79 
Safety in mines data 0-000517* | 0-000402 





* Computed from the values of S(x2,) given by Pearson (1950). 


In all three cases in which significance is attained the maximum-likelihood solution gives 
a higher level of significance than the combination of probabilities test. In two cases the 
difference is substantial. This suggests that the combination of probabilities test is not very 
efficient, though the differences may in part be due to chance causes or to inadmissible 
approximations in one or both the tests. 

Lancaster compared the power of the combination of probabilities test, the direct sum- 
mation of y? test, and the test derived from the pooled data by numerical methods, taking 
the case of the binomial distribution (¢ + p)> with p = 0-5 as the null hypothesis and enumer- 
ating all possible events. He made a similar comparison of the first two tests and some 
variants on a 2 x 2 table. Unfortunately, however, he used two-sided tests. This consider- 
ably reduces the efficiency of the combination of probabilities or summation of x? tests, 
and his estimates of the relative power of the three tests are therefore not really relevant to 
the situations met with in practice. 

A full investigation of the efficiency of the combination of probabilities test is beyond the 
scope of this note, but it is to be expected on general grounds that the combination of 
probabilities test will be somewhat inefficient. In the first place no direct cognizance is 
taken of the number of observations and other factors that affect the amount of information 
given by the different experiments. It is true that the more accurate experiments will, on 
the average, yield results of greater significance when there is a real difference. But if, for 
example, of two experiments A and B, A yields 10 times the amount of information given 
by B, we should naturally be inclined to give more weight to a significant result from experi- 
ment A than to one from experiment B. But on the combination of probabilities test the 
result P, = 0-025, Pz = 0-3, for instance, will give exactly the same combined level of 
significance (5% single tail), as P, = 0-3, Pz = 0-025. 

In the second place even if all the experiments yield the same amount of information this 
information is not in general combined efficiently by the combination of probabilities test. 


27-2 











410 Combination of probabilities test to a set of 2 x 2 tables 


As a simple example we may consider the case of a set of quantitative experiments which 
furnish normally distributed estimates 2,, 7, ...,z, of a constant treatment difference p, 
all having the same (known) variance o?. The efficient (indeed sufficient) combined estimate 
of u will then be %, and this will be normally distributed with known variance o?/k and can 
therefore be used to give an exact overall P, P; say. The exact level of significance can also 
be calculated for each experiment separately, and these levels can then be combined to 
give an overall P, P,. say. It is easily seen that in particular cases P, may differ considerably 
from P;. 


Table 3. Comparison of significance levels given by maximum likelihood (P;) 
and combination of probabilities (P,) in a sampling experiment 














| | 
P(%—0-9) P, Po logio (Po/P:) 
| 
0-9987 | 0-839 0-820 —0-010 
| 0-729 0-0808 0-0955 0-073 
0-719 0-0764 0-0460 — 0-220 
0-702 0-0694 | 0-0908 0-117 
0-663 0-0559 0-0409 —0-136 
0-659 | 0-0548 0-0296 | — 0-267 
0-641 0-0495 0-0512 0-015 
0-618 0-0436 0-0838 0-284 
0-540 0-0281 | 0-0145 — 0-287 
0-480 | 0-0197 | 0-0367 0-270 
0-405 | 0-0122 0-0318 0-416 
0-405 ! 0-0122 | 0-0132 0-034 
0-334 0-00734 0-0216 0-469 
0-258 0-00391 | 0-0100 | 0-408 
0-233 000307 0-00394 | 0-108 
0-142 0-00104 0-00118 | 0-055 
0-131 0-000874 | 0-00185 0-326 
0-111 0-000619 0-00287 0-666 
0-090 0-000404 0-00132 0-514 
0-038 0-0000784 0-0000973 0-094 
f sipipciiiatinl 
0-146 








A test of the relative performance of the two tests was made for a specific example of this 
type with k = 5, o = 1, and a true treatment difference of 0-9. With these values Z will 
attain significance at the 2-2 % level (single tail) in 50 % of all sets of five experiments. The 
actual levels attained by % and by the combination of probabilities test were calculated for 
twenty such sets, using random normal deviates nos. 6551-6650, each increased by 0-9, 
given by Wold (1948). The results are shown in Table 3. %— 0-9 will be normally distributed 
about zero mean with standard deviation 1/,/5. The probability P(% — 0-9) of getting a value 
of Z—0-9 greater than the observed value was calculated for each set, and the results have 
been arranged in order of magnitude of this probability. With a large number of sets there 
would be a uniform distribution of P(Z— 0-9) over the range 0-1. 

The results show clearly that P, tends to be less than Po, and that the difference is greater 
in the experiments with small P(%— 0-9) which, of course, also have small P,. The average 











A ~- @ co kel 


hich 
ef, 
nate 
| can 
also 
d to 
ably 





F. YatEs 411 


value of log;) Po /P;, is 0-146. In other words, the significance levels are on the average in 
the ratio of 1-40:1. There is thus a clear loss of efficiency with P,. Equally important, 
there are considerable discrepancies in individual sets of experiments, leading to a conflict 
of evidence from the same body of data which arises solely from the use of an inefficient test. 


SUMMARY 


It is shown that x? without correction for continuity will give one-tail probabilities for 2 x 2 
tables which may be safely combined in most cases likely to be met with in practice. The 
summation of the corresponding signed values of x gives a rapid method of combination. 
Reasons are given for believing that combination of probabilities tests are not likely to be 
very efficient, and this conclusion is demonstrated by a small sampling experiment. 


REFERENCES 


Cocuran, W. G. (1942). Iowa State Coll. J. Sci. 16, 421. 

Cocnran, W. G. (1952). Ann. Math. Statist. 23, 315. 

Cocuran, W. G. (1954). Biometrics, 10, 417. 

LanoastEr, H. O. (1949). Biometrika, 36, 370. 

Pearson, E. S. (1950). Biometrika, 37, 383. 

Wo xp, H. (1948). Random Normal Deviates. Tracts for computers, no. xxv. 
Cambridge University Press. 

Yates, F. (1934). J. R. Statist. Soc. Suppl. 1, 217. 

Yates, F. (1955). Biometrika, 42, 382. 








[ 412 ] 





A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS 
IN A TWO-WAY CUASSIFICATION 


By ALAN STUART 
Division of Research Techniques, London School of Economics 


1. INTRODUCTION 


There are several circumstances in which we may wish to test the homogeneity of the two 
sets of marginal probabilities in a two-way classification. For example, a sample from a 
bivariate distribution (say height of father, height of son) may be classified into a two-way 
table with identical (height) groupings in each margin. Or a similar classification may be 
possible for a non-measurale variable (say strength of right hand — strength of left hand). 
Again, in surveys of the s:me sample (a ‘panel’) on two different occasions, the inter- 
relation of the results on ‘he two occasions may be displayed in a two-way table, with one 
margin corresponding to each occasion. In all these cases, the question may arise: are the 
two sets of marginal probabiiities identical ? 

If the variable is measurable, we may test the difference between the means of the two 
marginal distributions by a large-sample standard-error test. However, we may be in- 
terested in the overall distributions, rather than only in their means. For the more 
stringent hypothesis of homogeneity, a test exists if we have two completely independent 
samples, when an ordinary x* test of homogeneity may be applied (Cramer, 1946, p. 445). 
This test does not meet the essentially bivariate situations described above, where 
non-independence of the marginal distributions is a fundamental feature of the 
problem. 

When the classification is a double dichotomy, the problem of testing marginal homo- 
geneity is simple, and its solution is a special case of the large-sample solution of the more 
general 2 classification problem given by Cochran (1950). Bowker (1948) gave a large- 
sample test for complete symmetry in a two-way classification, a more restrictive hypo- 
thesis which is concerned with the entire set of probabilities in the classification, and not 
only with the marginal probabilities as we are here. In the present paper, a large-sample 
test for marginal homogeneity is derived and illustrated. 


2. VARIANCES AND COVARIANCES 


With the usual notation for an m x m table, we denote the probability of falling into the ith 
row and jth column by p;;, and define the marginal probabilities by 


P:. = UPis> Ps = UPy 
while obviously DEP == UP. = =P. = 1, 


The corresponding sample numbers are denoted by n,,, ;., n.;, while the total sample size 
is n. 

















of 


an 


or 











ALAN STUART 413 


By standard multinomial theory, we have for the means, variances and co-variances 
of the Nig 


E(n,;;) = NDP;;, 
V(n,;) = np;(1 — pi;), (1) 
C(nin,%y) = —NPiy- Py (¢+l and/or k+)), 
and also E(n;.) = np; 
Vin;) = np, (1—p,,), 


“a (2) 
C(n;.,2;,.) = —p;.D;, (tJ), 


C(n 4,03) =—np py (tJ). 
We now require C(n,; 2,3) = Ving) + % Claw Mj), 
where 7 +/ and/or k+j. From (1), this is 
C(n, 5) = Mpi(1 —Pis)— E Pin Pash 
= 1 Pis— Pi.P. 5), (3) 
on taking the term in pj; into the summation. We now define the statistics 
d,=n,—n,; (¢=1,2,...,m). 


Since the likelihood-ratio principle yields an intractable result in this case, it seems natural 
to use the d; as the basis for a test concerning the differences between the corresponding 
marginal probabilities. For their means, variances and co-variances, we have, from (2) 
and (3), the exact results: 


E(d;) = E(n;,—n_;) = n(p;.—P.;), (4) 
V(d;) = V(n,;_)+ V(n 4) —2C(n,,, n,;) 
= 1{(D;,.+P.i— 2D) — (Pi. — P.i)}- (5) 


Also, for i+4 
C(d;,d;) = C(n,;_,n;.) +C(n 4,2 3)— {C(n;. n 3) +C(n.;, n,; )} 


= — (Piz + Dy) + (Di, —D.2) (Dj. — P.5)}- (6) 


3. TESTING THE HYPOTHESIS OF HOMOGENEITY 


Suppose that we wish to test the hypothesis 


Ay: (p;.—pi)=A; (}=1 tom), 


n 
where, of course, we must have > A; = 0. 
i=1 


(4), (5) and (6) then become 
B(d,|H,) =nd, 
V(d;|A,) = {(9,.+9.,— 2p) — AZ. (7) 
O(d;,d,;| H,) = —n{( py t+pp)+A,A,} (¢+3), 








414 Homogeneity of marginal distributions 
and in the special case of particular interest here, where we have 
H,: A;=9 (alli), 
(7) becomes E(d,;| Hy) = 9, 
Vig = V(dz| Hy) = (95.4 P.4— 24); (8) 
Viz = O(d;,d; | Hy) = —(py3+Py) (tJ). 


We have deliberately refrained from replacing p; by p,, in V;;, as we could have done, 
since if, as will generally be the case, the true probabilities are unknown, the maximum- 
likelihood estimators of the variances and co-variances in (8) are 


V,, =n, +n ,—2n,, 
seg te +n; w| ( 9) 
Vig = — (gp + 2y4)- J 

It is well known that the n;; have a limiting multinormal distribution with moments given 
by (1). It follows that the n,. and n., will also be jointly normally distributed, since they are 
linear functions of the n,,, and finally, by the same argument, that the variates d; (i = 1 
to m) will be asymptotically multinormally distributed with moments given by (7) in 
general, and by (8) on the null hypothesis. To a further degree of approximation, the same 
result holds if (9) is used as an estimator of (8). 

Now it is a standard result that in the exponent, say — $Q, of a multinormal distribution, 
Q is distributed like xy? with degrees of freedom equal in number to the rank of the dis- 
tribution. Furthermore, Q in this exponent is simply a quadratic form in the variables. 

Thus, on Ah, = 
Q= » mis ted (10) 


is distributed in the limit like y? with (m— 1) degrees of freedom, the rank of the distribution 
n 
being (m—1) since } d; = 0, and in general this is the only constraint on the d,. If the dis- 
i=l 


tribution were non-singular, the coefficients a,; would simply be the elements of the inverse 
of the variance-covariance matrix (V;,;). As a result of the singularity, (V,;) itself is also 
singular, and so we cannot invert it. 

However, any marginal distribution of a multinormal distribution is itself normal, so 
that we may eliminate the redundant variate (sey the last) and obtain the result that 


Q= rs Vitd,d, (11) 
1,j= 
is distributed in the limit like x? with (m — 1) degrees of freedom, (m — 1) still being the rank 
of the distribution. If we replace the V“ in (11) by their maximum-likelihood estimators 
given by (9), the same asymptotic result holds. In (11), (V,;) is now understood to be defined 
for i,j = 1, to m—1 only, and (V%) is its inverse. 

The fact that (11) takes no explicit account of d,, lends it an appearance of arbitrariness, 
-but although the values of the terms in Q are changed by eliminating some other d; instead 
of d,,, their sum Q is uniquely determined. This is in virtue of the fact that since any (m— 1) 
of the d,; uniquely determine the other one, Q must be, apart from constants, the log likeli- 
hood of the complete set of d;, irrespective of which of the d; is omitted in Q. In point of 
fact, Q could be expressed as a function of all the m values of d; and the complete matrix 








Th 








ALAN STUART 415 


(V,;), but this merely complicates the computational procedure and makes no difference 
to the result. 

Finally, we must determine the appropriate critical region of the distribution of Q. 
(7) shows that, when A) is not true, the variance-covariance matrix of the d; may be written 


(V,;) = n(wiz), 
where (w,;) is independent of n. The inverse matrix is therefore 


(Vi) =~ (wi), 
Thus the expected value of Q in (11) is 
m1] |. 
E(Q|H,) = Y —wtk(d;d,) 
ij=1% 


= T= wiifO(d,,d;) + B(d,) E(d;)}, 


so that, using (7), E(Q | H,) = O(n). (12) 
On the other hand, the limiting x? distribution of Q on H, gives us 

E(Q| H.) =m-1, (13) 

independent of n. Thus, with increasing n, the difference between (12) and (13) will exceed 


any bound, and the appropriate critical region for our large-sample test is the upper tail 
of the distribution of Q. The same conclusion holds if (V;;) is replaced by its maximum- 


likelihood estimator (V,,). 


4, COMPUTATIONAL PROCEDURE: AN EXAMPLE 
We now set out the computation necessary for the test, and give 2. illustration of its use. 
(1) Form the matrix (G;;) given by (9), for 1,7 = 1 to (m—1). 
(2) Invert (V;,) to obtain (V*). 
m—-1 A 
(3) Compute Q= > Vid;d;, where d; = n;,—n_,, and test in the y* distribution with 
4j=1 
(m— 1) degrees of freedom, the critical region being the upper tail. 
Consider the following example, which uses data quoted by Stuart (1953). 


7477 women aged 30-39 ; wnaided distance vision 














Left eye , ; | 

Highest Second Third Lowest 
Total 

: grade grade grade grade 

Right eye 

Highest grade 1520 266 124 66 1976 
Second grade 234 1512 432 78 2256 
Third grade 117 362 1772 205 2456 
Lowest grade 36 82 > 179 492 789 
Total 1907 2222 2507 841 7477 





























416 Homogeneity of marginal distributions 
We have d, = 1976—1907 = 69, 
d, = 2256 — 2222 = 34, 
ds, = 2456-2507 = —51, 
and we check that d, = 789—841 = — 52 equals the sum (d,+d,+d5), negatively signed. 
Since d, and d, are positive, while d, and d, are negative, one may ask whether the sight of 


the right eye may really be regarded as better than that of the left eye for the population 
from which this is a sample. The estimated variance matrix of d,, d,, d, is obtained by use 


of (9), as D,, = (266 + 124 + 66) + (234 + 117 +36) = 843, 


A 


Vio = Voy = — (234+ 266) = — 500, 


‘: 843 -—500 —241 
and so on, giving (Vi;)=|—500 1454 —794). 
—241 —794 1419 


The inverse is obtained directly as 
2482 1560 3) 


(= 1-150 1972 1368 
1295 1368 1690 


and we have, for our x? statistic with 3 degrees of freedom, 


Q = 10-*{(2482 x 692) + (1972 x 342) + (1690 x 512) + 2(1560 x 69 x 34) 
— 2(1295 x 69 x 51) — 2(1368 x 34 x 51)} 
= 11-96. 
The 1 % point for the x? distribution with 3 degrees of freedom is 11-345, so we conclude that 
the result is significant of a difference in the distribution of sight in right and left eye. 
If, instead, of eliminating d,, we had eliminated, say, d, we should have obtained as above: 


_ © QO km © mel 


and finally 


" 1454 -—794 —160 
(V,;) = (- 794 1419 —384], 
—160 —384 646 
1332 995 i) 


(= 104 995 1583 1187 
921 1187 2482 


Q = 10-*{(1332 x 34) + (1583 x 51”) + (2482 x 522) + 2(1187 x 51 x 52) 


= 11-96 
as before. 


I am grateful to Mr J. Durbin for several illuminating discussions of this problem. 


— 2(995 x 34 x 51) — 2(921 x 34 x 52)} 


REFERENCES 


Bowker, ALBERT H. (1948). A test for symmetry in contingency tables. J. Amer. Statist. Ass. 43, 
572-4. 

Cocuran, W. G. (1950). The comparison of percentages in matched samples. Biometrika, 37, 256-66. 

CraMER, Haratp (1946). Mathematical Methods of Statistics. Princeton University Press. 

Sruart, A. (1953). The estimation and comparison of strengths of association in contingency tables. 
Biometrika, 40, 105-10. 











[ 417 ] 


DISTRIBUTIONS OF KENDALL’S TAU BASED ON PARTIALLY 
ORDERED SYSTEMS 


By SOL HABERMAN 
University of Minnesota 


INTRODUCTION 


If a pair of objects a, b are connected by a relation R which is non-reflexive, anti-symmetric 
and transitive, we say, if aRb, that a precedes b. Suppose we have a set of n objects in which 
R holds between some, not necessarily all, objects. The situation may be represented dia- 
grammatically by a set of points, one to each object, joined by lines between each pair of 
objects which are connected by R. The relation aRb between any pair exists, if any, only if 
there is a path along the network joining the corresponding points. We may represent the 
direction of the relationship by an arrow on the corresponding line, and the relation aRb 
requires that the arrows along one path, at least between a and 5, all go in the same direction. 
For convenience, in this paper, we shall regard all our directions as going downwards (from 
the head of the page towards the foot), and it will not be necessary actually to draw the 
alTOws. 

On this understanding, any downward path in the diagram corresponds to an ordering 
of the objects through which it passes. Now suppose that each of the n objects exhibits one 
of the ranks of each of a set of ranked variables. For example, if there are two variables, 
A, a dichotomy into male and female (ranked 1, 2), and B, a fourfold division into social 
class, working, lower-middle, upper-middle and upper (ranked 1, 2, 3, 4), any individual 
under consideration will bear one of the ranks for each of A and B; the ranking (23), for 
instance, denoting a female in the upper middle class. In this particular case there are 
2x 4 = 8 types of individual and the number n of objects is accordingly 8. 

If comparisons are made between neighbouring members differing only in one variable, 
beginning with (1, 1) and ending with (2, 4) we have what is called a partial ordering. Thus 
(11), (12), (13), (23), (24) is such an ordering, and so is (11), (21), (22), (23), (24). The situation 
is illustrated in Fig. 1, both partial orderings corresponding to downward paths on the 
diagram. Generally, if the objects correspond to rankings of p,, ps, ..., p, a partial ordering 
will be a set of p,+p.+...+p,—17+ 1 objects such that any consecutive pair is of the type 
(a,b, ...,9,...,7), (@,6,...,.9 +1,...,7). The number n is equal to p;p,... p,- 





—r)! 
There will be n! rankings of the n objects and (ps—r) , partial orderings 


(Pi-1)!(p_.—1)!...(pp— 1) 
beginning with (111... 1) and ending with (p,p.p,...p,). For in the process of following 
through any partial ordering any rank, say the jth, has to make p,— 1 unit moves and these 
can occur at any point in the X(p;— 1) moves. Thus with p, = 2, p, = 4 there are four partial 


orderings: 


(11) (21) (22) (23) (24) 
(11) (12) (22) (23) (24) 
(11) (12) (13) (23) (24) 
(11) (12) (13) (14) (24) 


(1) 








418 Distributions of Kendall’s tau based on partially ordered systems 


With p, = 2, p, = 2, p, = 2 there are six partial orderings each comprising four objects; 
and so on. The totality of partial orderings may be called the set of such orderings. 
Consider a ranking of the n objects, e.g. for 2 x 4 objects, 


(11) (12) (13) (14) (21) (22) (23) (24). (2) 


We see that every pair of objects in the partial orderings (1) are in the same order as in the 
ranking (1). Such a ranking is said to be consistent with the set of partial orderings. It is 
not unique, e.g. 

(11) (12) (13) (21) (14) (22) (23) (24) (3) 


has the same property. Any other ranking may be compared with one of the consistent 
rankings by considering the minimum number of interchanges of adjacent pairs necessary 
to transform one to the other. More than one consistent ranking may be reached by the 
same minimum number of interchanges. 


21 


22 


24 
Fig. 1. Partially ordered system for two variables, one dichotomous and the other fourfold. 


The problem discussed in this paper may be formulated as follows: A set of partial 
orderings is given, and a ranking of the objects is observed. Can this be regarded as con- 
sistent with the partial orderings? If the observed ranking is not subject to error, of course, 
this question is simply decided by examining whether it is one of the consistent rankings. 
We may also, however, allow for error in the observed ranking and regard it as consistent 
in a probabilistic sense if it is close to a consistent ranking, ‘closeness’ in this sense being 
measured by the minimum number of interchanges of adjacent pairs necessary to transform 
the observed to a consistent ranking. 

This number of interchanges will be denoted by s, and our object is to find the frequency 
distribution of s in the population of n! possible rankings of the objects. Thus, the hypo- 
thesis that an observed ranking is consistent with a set of partial orderings will be accepted 
if sis small; or, equivalently, if the probability of the observed s or a lower value is below 
some assigned significance level. The procedure is similar to Kendall’s use of s to measure 
the agreement between two rankings (cf. Kendall, 1955). The difference in our case is that 








j=} 


er ee a ae... 


cts; 








Sot HaBERMAN 419 


we require the agreement between an observed ranking and one of a number of possible 
consistent rankings, not a unique one. Kendall’s 7 is defined in terms of s by 
48 
Se ane 4 
n(n — 1) (4) 
but for our purposes it will be sufficient, and more convenient, to work with s itself. 


T= 


DISTRIBUTION OF 8 FOR THREE DICHOTOMOUS VARIABLES 


I consider in the first place three dichotomous variables (p,=p,=p,;=2). There are, as 
noted above, six partial orderings of four. The corresponding diagram is given in Fig. 2. 
The 8 objects give rise to 8! = 40,320 rankings. Of these 48 are consistent. 


111 
112 Ss 211 
122 221 


222 
Fig. 2. Partially ordered system for three dichotomous variables. 


This last result may be seen as follows: Since (111) comes first and (222) last we need only 
consider the other six objects. The six partial orderings are (112) (122); (112, 212); (121, 122); 
(121, 221); (211, 221); (211, 212). These do not impose any conditions on the order of the 
three (112), (121), (211) or of (122,212,221). But any member of the first precedes two 
members of the second. The first three must then occupy three of the first four places. It 
will be seen that if there is no overlap between members of the two sets the number of 
arrangements is 3! x 3! = 36. The only possible overlap occurs when one member of the 
first triad follows a member of the second in the fifth place. This gives rise to 12 further 
possibilities making 48 in all. 

Let (111) be represented by A; each of (211), (121), (112) by B; each of (221), (212), (122) 
by C; and (222) by D. 

In a ranking of eight A and D can occur in 56 different ways, which we can classify into 
14 categories as follows: 


A — A oe A 


| | 
| | 
| & | 


(1) (2) 


StI tt | & 
| 
| esiadg” 


I Stl | 


D 








420 Distributions of Kendall’s tau based on partially ordered systems 


and so on up to 
— A 
A D 
D 


A 
Ws Tein a) 


The seven categories which follow are a horizontal mirror image of the first seven. 

When any given ranking of the three B’s and C’s is placed in the blank positions it is 
found that s increases by 1 for each successive category. We can then build up the distribu- 
tion of s in the 56 x 720 = 40,320 rankings of 8 when we have the distribution for one case, 
e.g. the array with A first and D last. It is not easy to ascertain this distribution. There are 
720 members and each has to be compared with one of 48 rankings. By writing down the 
720 permutations and counting I find the distribution given in Table 1. The distribution 
for the 40,320 rankings is then obtained as in Table 2. 


Table 1. Distribution of s for the 6 B’s and C’s 


| 





8 | Frequency of s 





al 


fo) 
rs 
oa 


pI Oo PrP OD 





© © 
o& 
a 











Some further distributions for the case of two variates are given in Table 3 for the 2 x 2, 
2x3, 2x4 and 3 x 3 cases. 

With four dichotomous variables there are 16! rankings to be considered. (1111) and 
(2222) occur in 240 ways in a ranking of 16 and, as before, may be treated independently of 
the remaining 14 combinations in any enumeration for the distribution of s. Since direct 

‘enumeration of the distribution of s for the remaining 14 combinations would have been 
too tedious, an estimate of the form of the distribution was obtained by sampling. 

The 14 combinations were written on slips which were placed in a bow] from which drawings 
were made to obtain a sample of some 2500 rankings. This number was held to be sub- 
stantial in view of the 240 rankings of 16 associated with each of the 2500. The 240 x 2500 


Cate 








1 
2 
3 
Q 
5 
6 
7 
8 
9 
10 


ll 





\ Sot HaBERMAN 421 


values of s were weighted according to the theoretical frequencies of certain sub-distribu- 

tions to which they belonged. When graphed as a frequency polygon, the final estimate of 

the form of the distribution appeared almost identical with a normal curve. I am unable 

at this stage to give a more comprehensive account of the distributions of s and partially 
j ordered systems. 


Table 2. Computations used to obtain the distribution of s for the rankings of the eight 
combinations of three dichotomous variables 






































C ] ] 
cate’ | 22| 21 20 19 | 18/17/16} 15! 14] 13 | 12, 11/10; 9} 8|7] 6] 5 diba ks 1|0 
t is | and ee ee See Oe ee ke 
j | H | | 
bu- 1 j—|/—!—| —| —}| —| —| —!|—| —| —| — | —]| 24] 36! 60; 96) 108! 108/108! 84/48/48 
ase, 2 j—|—j;—| — | —}—),-—-1-—)- i - le 48 | 72120192 |216|216 216 168| 96 | 96 | — 
3 |—|—|—| —|— | —| —| —| — | — | —] 72 | 108 | 180 | 288 | 324 | 324 | 324 | 252 | 144 | 144 —| — 
are ¢ |—|—[—| Se | | oe 1 06 | 144 | 240] 384 | 432] 432 | 452] 336 | 102| 192) —j-—|-— 
the 5 |—|—|—| —| — | — | — | — | — | 120; 180 300 | 480 | 540 | 540 | 540 | 420 | 240 | 240) — | — | —|— 
= 6 |—|—,—| —| — | — | — | — | 144| 216 | 360 | 576 | 648 | 648 | 648 | 504 | 288 288; — | — | — |—|— 
wanes 7 |—|-—|— — | —| — | — | 168 | 252 | 420 | 672 | 756 | 756 | 756 | sss | 336 336; —' —}—|—|—!— 
8 |—|—|—| —| —j — Bae 420 | 672 | 756 | 756 | 756 | 588 | 336 | 336) — | — — | — | — |—|— 
9 jb —| — | — | 144| 216 | 360| 576 | 648 | 648 | 648 | 504 | 288! 288) — | — | — | — | — i\-|-|- 
10 |—|—|—} — | 120 | 180 | 300 | 480 | 540 | 540! 540 | 420 | 240! 240 —j—!—j;—;—|—]}] —|/-|— 
11 |—|—j}—| 96| 144| 240] 384 | 432 | 432 | 432| 336: 192) 192, — |; —| —, —|—| —} —| —|—|— 
2 |—|—|72| 108| 180 | 288 | 324| 324| 324| 252] 44) 4a —' —| —| —| —'— |} — 1 — | — —|— 
13 |—| 48| 72| 120/292 | 216 | 216 | 216} 168} 96] 96; —| —| —;—; —; — +1 — |] —| — i —]— 
} | 14 | 24/36| 60/ 96 108 | 108 | 108| 84) 48; 48; —;—}—)|—) — ) — } — Sj — J es ee 
MS SS. See, Se ee Se ey ee see ee 
| ial | | | | | | | hie 
Totas| 24 | 84 |204) 420 | 744 |1176|1716| 2316) 2904/3444 |3828 4008 /3972/3720 3276/2724 211215121008, 612 | 324 |144 * 
} | | 
Grand total, 40,320 
MISSING COMBINATIONS 
In the preceding, distributions of s are given for the situation where all of the possible 
combinations of the ranks of a set of variables are considered. It happens in practice, how- 
ever, that some combinations do not appear at all or else appear with negligible frequencies. 
In such cases those combinations cannot be ranked with respect to the others. Since the 
problem discussed in this paper is to test if an observed ranking of combinations is con- 
sistent with a partially ordered system of the same combinations, we have to derive dis- 
tributions based on partially ordered systems in which those combinations do not appear. 
x 2, This was done in connexion with the derivation of the distribution of s for the case of three 
dichotomous variables. Table 1 gives the distribution which is based on the partially 
me: ordered system in which both (111) and (222) are missing. 
dae When combinations are considered a few at a time, the distributions so formed often fall 
fads into classes where within each class all are identical. For instance, when, in the case of 
_— three dichotomous variables, the combinations are considered 7 at a time we get either of 
two forms of distribution of s: 
na (a) Form F, if one of the B’s or C’s is missing; and 
ae (6) Form G, if either A or D is missing. 


The two forms of distribution are given in Table 4. 





422 Distributions of Kendall's tau based on partially ordered systems 


Table 3. Two variable cases 


A, both dichotomous. B, one dichotomous and the other having three ranks. 
C, one dichotomous and the other having four ranks. D, both having three ranks. 





























| Frequencies 
8 ) ) 
A B | C D 
! | | 
| | 
0 2 5 14 42 
1 | 4 | 15 56 170 
2 | 6 32 | 150 464 
3 6 53 318 1,016 
4 4 77 587 1,946 
5 2 96 963 3,362 
6 _ | 105 1,448 | 5,350 
7 | — 102 2,009 7,942 
8 — 88 2,599 11,096 
a) | — | 67 3,151 | 14,664 
10 -- 43 3,601 | 18,442 
11 — 24 3,883 22,138 
12 — 10 3,958 25,434 
13 — 3 3,817 28,010 
14 — — 3,477 29,604 
15 | — — 2,990 30,054 
16 — — 2,418 29,306 
17 -- — 1,832 27,454 
18 — — 1,289 24,684 
19 -- — 839 21,276 
20 — | 495 17,548 
21 — — 260 | 13,818 
22 — — 115 10,354 
23 — — 41 7,354 
24 — — 9 4,916 
25 — — 1 3,072 
26 — — —_ 1,770 
27 — — — | 930 
28 | -- — — | 432 
29 — _ | — 170 
30 — — — | 52 
31 ‘ie aie pros | 10 
| 


| Totals 24 720 40,320 362,880 




















Sot HABERMAN 423 


Table 4. Distribution of s for three dichotomous variables with one combination nrissing 














Frequencies 
8 
Form F Form G 
(6 distributions) | (2 distributions) 

0 16 48 
1 48 96 
2 104 180 
3 184 288 
4 284 396 
5 396 504 
6 500 600 
7 572 612 
8 604 600 
9 588 540 
10 528 432 
11 436 324 
12 324 216 
13 220 120 
14 132 60 
15 68 24 
16 28 —- 
17 8 a 
Totals 5040 5040 

















28 Biom. 42 











424 Distributions of Kendall’s tau based on partially ordered systems 


Table 5. Distribution of s for the case of two variables, one dichotomous 
and one fourfold, with one combination missing 














Frequencies (two distributions in each form) 
8 
Form P Form Q Form R Form S 
0 14 5 9 7 
1 42 20 32 26 
2 94 52 75 64 
3 168 105 142 125 
4 269 182 232 210 
5 376 278 338 312 
6 485 383 445 419 
7 561 480 533 512 
8 604 553 587 575 
9 594 588 597 596 
10 544 578 560 570 
ll 450 525 484 503 
12 344 439 382 407 
13 235 337 275 301 
14 145 235 178 201 
15 74 147 101 119 
16 32 80 48 60 
17 8 37 18 25 
18 1 13 4 7 
19 — 3 -— 1 
Totals 5040 5040 5040 5040 























Again, when in the case of two variables where one is dichotomous and the other has 
four ranked classifications, the combinations are considered 7 at a time and we get four forms 
of distribution of s: 

(a) Form P, if either (11) or (24) is missing; 

(b) Form Q, if either (21) or (14) is missing; 

(c) Form R&, if either (12) or (23) is missing; and 

(d) Form S, if either (13) or (22) is missing. 

The four forms of distribution are given in Table 5. 


The writer is indebted to Prof. F. Stuart Chapin for encouragement and to Prof. 
Palmer O. Johnson for guidance. 


REFERENCE 
KENDALL, M. G. (1955). Rank Correlation Methods, 2nd ed. London. Chas. Griffin and Co. 








ca 





ns 


of. 








[ 425 ] 


ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS 


By HERBERT A. SIMON+ 
Carnegie Institute of Technology 


I. IytRopvucTION 


It is the purpose of this paper to analyse a class of distribution functions that appears in 
a wide range of empirical data—particularly data describing sociological, biological and 
economic phenomena. Its appearance is so frequent, and the phenomena in which it appears 
so diverse, that one is ied to the conjecture that if these phenomena have any property in 
commonit can only bea similarity in the structure of the underlying probability mechanisms. 
The empirical distributions to which we shall refer specifically are: (A) distributions of 
words in. prose samples by their frequency of occurrence, (B) distributions of scientists by 
number of papers published, (C) distributions of cities by population, (D) distributions of 
incomes by size, and (E) ‘istributions of biological genera by number of species. 

No one supposes that there is any connexion between horse-kicks suffered by soldiers 
in the German army and blood cells on a microscope slide other than that the same urn 
scheme provides a satisfactory abstract model of both phenomena. It is in the same direc- 
tion that we shall look for an explanation of the observed close similarities among the five 
classes of distributions listed above. 

The observed distributions have the following characteristics in common: 

(a) They are J-shaped, or at least highly skewed, with very long upper tails. The tails 
can generally be approximated closely by a function of the form 

f(t) = (a/t*) BF, (1-1) 
where a, b, and k are constants; and where 6 is so close to unity that in first approximation 
the final factor has a significant effect on f(i) only for very large values of i. Thus, for example, 
the number of words that occur exactly 7 times in James Joyce’s Ulysses is about a/i*; the 
number of authors who published exactly i papers in Hconometrica over a twenty-year 
period is approximately a/i*; and so on. 

(b) The exponent, k, is greater than 1, and in the cases of word frequencies, publication, 
and urban populations is very close to 2. 

(c) In the cases of word frequencies, publications and biological genera, the function (1-1) 
describes the distribution not merely in the tail but also for small values of i. In these cases 
the ratio f(2)/f(1) is generally in the neighbourhood of one-third, and almost never reaches 


one-half; while f(1)/n, where n = }\f(¢), is generally in the neighbourhood of one-half. 
i 


Property (a) is characteristic of the ‘contagious’ distributions—for example, the negative 
binomial as it approaches its limiting form, Fisher’s logarithmic series distribution. How- 
ever, in the case of the negative binomial, k cannot exceed unity (and equals unity only in 

+ I have had the benefit of helpful comments from Messrs Benoit Mandelbrot, Robert Solow and 


C. B. Winsten. I am grateful to the Ford Foundation for a grant-in-aid that made the completion of 
this work possible. 


t See Zipf (1949) for numerous examples of distributions with this property. 


28-2 








426 On a class of skew distribution functions 


the limiting case of the log series); and if the distribution has a long tail, so that the con- 
vergence factor, 5, is close to unity, f(2)/f(1) cannot be less than one-half. Hence the negative 
binomial and Fisher’s log series distributions do not provide a satisfactory fit for data 
possessing property (a) together with either (b) or (c).+ 

It is weli known that the negative binomial and the log series distributions can be obtained 
as the stationary solutions of certain stochastic processes. For example, J. H. Darwin 
(1953) derives these from birth and death processes, with appropriate assumptions as to 
the birth- and death-rates and the initial conditions. In this paper we shall show that 
stochastic processes closely similar to those yielding the negative binomial or log series 
distributions lead to a class of functions having the three properties enumerated above. 
This class of functions is given by 


f(t) = ABYi,p +1), (1-2) 
where A and p are constants, and B(i,p +1) is the Beta function of i,9 +1: 
. afr I(t) C(p+1) ' 
= “111 -A}PedA = —_———_ : ; : 
Bli,p +1) [a ( yeda TG+p+1) (0<1t; 0<p<o) (1-3) 


Now it is a well-known property of the Gamma function (Titchmarsh, 1939, p. 58) that 
as i oo, and for any constant, k, ri) 





mink ‘ 
TG+kh * (1-4) 
Hence, from (1-3), we have, as io: 
f(i)~T(p+ 1) i-o, (1-5) 


Therefore, the distribution (1-2) approximates the distribution (1-1) in the tail (more 
precisely, through the range in which the convergence factor of the latter is close to one). 
Further, if p is positive, & will be greater than 1, as required by (6); and if p is equal to 1, 
k will be equal to 2. It is easy to see that in the latter case we will have 





W=aay Bfo=. (1-6) 
so that f(2)/f(1) = 4; and f(1)/n = 4, as required by (c). 

In the remainder of this paper I propose: (a) to describe a stochastic process that leads 
to the stationary distribution (1-2); (6) to discuss some generalizations of this process; 
and (c) to construct hypotheses as to why the empirical phenomena mentioned above can 
be represented, approximately, by processes of this general kind. Before proceeding, I 
should like to mention two earlier derivations, one of (1-2), the other of (1-1), that I have been 
able to discover in the literature. 

Some thirty years ago, G. Udny Yule (1924) constructed a probability model, with (1-2) 
as its limiting distribution, to explain the distribution of biological genera by numbers of 
species. He also derived a modified form of (1-2), replacing the complete Beta-function of 
(1-3) by the incomplete Beta-function with upper limit of integration a<1. (This modi- 
fication has the same effect as the introduction of the convergence factor, 5‘, in (1-1)—it 
causes a more rapid decrease in /(i) for very large values of 7; cf. also Darwin (1953, p. 378).) 
It seems highly appropriate to call the distribution (1-2) the Yule distribution. 

+ The contrasting characteristics of distributions for which the log series provides a satisfactory 


fit and those, under consideration here, for which it does not are illustrated by examples (i) and (ii), 
respectively, in Good (1953). 








st 


= 


Crm es 











HERBERT A. Smmon 427 


Because Yule’s paper predated the modern theory of stochastic processes, his derivation 
was necessarily more involved than the one we shall employ here. Moreover, while the 
assumptions he required are plausible for the particular biological problem he treated, the 
corresponding assumptions applied to the four other phenomena we have mentioned appear 
much less plausible. Our derivation requires substantially weaker assumptions than Yule’s 
about the underlying probability mechanism. 

More recently D. G. Champernowne (1953) has contructed a stochastic model of income 
distribution that leads to (1-1) and to generalizations of that function. Since the points of 
similarity between his model and the one under discussion here are not entirely obvious at 
a first examination, I shall consider their relation in a later section of this paper. 


II. THE STOCHASTIC MODEL 


For ease of exposition, the model will be described in terms of word frequencies. In a later 
section, alternative interpretations will be provided. Our present interest is in the kind of 
stochastic process that would lead to (1-2). 

Consider a book that is being written, and that has reached a length of k words. We 
designate by f(i, &) the number of different words that have occurred exactly 7 times in the 
first k words. That is, if there are 407 different words that have occurred exactly once each, 
then f(1,k) = 407. 

Assumption I. The probability that the (k +1)-st word is a word that has already appeared 
exactly i times is proportional to if(t,k)—that is, to the total number of occurrences of all 
the words that have appeared exactly 7 times. 

Note that this assumption is much weaker than the assumption (I’): that the probability 
a particular word occur next be proportional to the number of its previous occurrences. 
Assumption (I’) implies (I), but the converse is not true. Hence we leave open the possi- 
bility that, among all words that have appeared 7 times the probability of recurrence of 
some may be much higher than of others. 


Assumption II. There is a constant probability, a, that the (k+1)-st word be a new word— 
a word that has not occurred in the first k words. 

Assumptions (I) and (II) describe a stochastic process, in which the probability that a 
particular word will be the next one written depends on what words have been written 
previously. If this process correctly describes the selection of words, then the words in a 
book cannot be regarded as a random sample drawn from a population with a prior dis- 
tribution. The reasonableness of the former, as compared with the latter type of explanation 
of the observed distributions, will be discussed in §1V. 

From (I), it follows that 


E(f(i, k+l }—f(i,k) = K(k) ((i-— Dfi-1,k) -if i, DH} (@ = 2-.k+1), (21) 
for if the (k+1)st word is one that has previously occurred (i— 1) times, f(t, + 1) will be 
increased over f(i,k), and the probability of this, by assumption (I), is proportional to 
(i—1)/(¢—1,k); if the (k+1)st word is one that previously occurred é times, f(?, + 1) will 
be decreased, and the probability of this, by assumption (I), is proportional to (i, ’); 
while in all other cases, f(i,k +1) = f (i,k). . 

From (I) and (II) we obtain similarly 


E{f(1,k+1}—f(1,k) = a—K(k)f(l,k) (O<a<1). (2-2) 











428 On a class of skew distribution functions 


Since we will be concerned throughout with ‘steady-state’ distributions (as defined by 
equation (2-8) below), we replace the expected values in (2-1) and (2-2) by the actual fre- 
quencies. (Alternatively, we might replace frequencies on the right-hand side of the 
equation by probabilities.) That is, we write, instead of (2-1) and (2-2), 


f(l,4+1)-—f(1,k) = a— K(k) f(1, k), (2-4) 


where the f’s now represent expected values. 
Now, we wish to evaluate the factor of proportionality K(k). Since K(k)if(i, k) is the 
probability that the (4+ 1)st word is one that previously occurred 7 times, we must have 


S K(k) if(i,k) = K(k) ¥ ifli,k) = 1a. (2-5) 


i=1 t=1 


k 
But > if(t,&) is the total number of words up to the kth, hence 
i=1 


S ifli,k) =k, (2-6) 
t=] 
and K(k)= =. (2-7) 


Substituting (2-7) in (2-3) and (2-4), we could solve these differential equations explicitly. 
We can avail ourselves, however, of a simpler—though non-rigorous—method for dis- 
covering the solutions, and can then test their correctness by substitution in the original 
equations. Consider the ‘steady-state’ distribution in the tollowing sense. We assume 

a ie Bet for all ¢ and E; (2-8) 
so that all the frequencies grow proportionately with k, and hence maintain the same 
relative size. (Since we must have f(i,k) = 0 for i>k, equation (2-8) cannot hold exactly 
for all i and &. But as explained above, we are concerned at the moment with heuristic 
rather than proof.) 

From (2-8) it follows that 

fib) _ _flisk+l) _ 9 
fi-1,k) fi—1,k+1) , 





(2-9) 


where f(i) does not involve k. Hence, the relative frequencies, which we will designate by 
f*(i), are independent of k. Substituting (2-7), (2-8) and (2-9) in (2-3), we get 








k+1 - 2, _ (l—@) (i-1) | ‘ ; 


Cancelling the common factor, and solving f for (i), we obtain 


»  (a~eye—r) | ye) 
AY) = Ty0-ayi = FG) ( 








i= 2,...,k). (2-11) 


For convenience, we introduce 


1 
is oe -12 
p re. (l<p<oo). (2-12) 





f*( 


a 
2 


0) 





HERBERT A. Smvon 429 
Since f*(i) = (i) f*(i—1) = B(i).B(i— 1)... B(2)f*(1), we obtain from (2-11) and (2-12) 


(i—1)(¢—2)...2.1 Ti) P(p +1) 





f*() = (¢+p)(i+p—1)...(1 +p Ch) ro TG+pti)? ) ais Bii,p+ 1) f*(1) (t = 2,...,k). 
2-13 

The second relation follows from the fact that , 

TG+pt1) = (i+p) TG +p) = (6 +p) (i +p—1)... (1 +p) T(p+D). (2-14) 


But (2-13) is identical with (1-2) if we take A = f*(1). 

That (2-13) is in fact a solution of (2-3) can be verified by direct substitution. Moreover, 
it is in the following sense a stable solution. Suppose that (2-11) is not satisfied. Whatever 
be the values of the f(t, k) for a given k, we may write without loss of generality 


f(i—1,k) (1—a)t+1+e(t,k)’ 





where e(7, k) is some function of i and k. If we now divide both sides of (2-3) by f(t, &) and 
substitute (2-15) in the right-hand side of the resulting equation, we obtain after simpli- 
fication 





fi,k+1)  k+1+e(i,k) : 
a Bees seaedis 7 
Hence the ratio of f(t, + 1) to f(t, &) will be greater than (k + 1)/k if e(%, k) is positive, and 
less than (k + 1)/k if e(¢, ) is negative. Since new words are introduced at a constant rate, 


k 
f(t, &) must be proportional to k; therefore, by (2-16), we will have 
1 


k+1* os , 

E fli (k+1)- $5 Fe )= 7 Delt, k)f(t, k) = 0. (2-17) 
i=1 i=1 

We may interpret the three equations, (2-15)-(2-17), as follows. In an average sense, the 
frequencies will grow proportionately with k. If a particular frequency is ‘too large’ 
compared with the next lower frequency (¢(i, ) negative in (2-15)), it will grow at a rate 
slower than the average; if it is ‘too small’ (e(t, k) positive), it will grow more rapidly than 
the average. 

It remains to be shown that tf *(i) = B(i,p+1)f*(1) is a proper distribution function. 
In particular, we require that . iB(i,p +1) converge as k->oo. Now, it is well known that 


> i-* converges for every a> 1. ‘But by (1-4), 
i= =1 


SiBli,p+1)~i.iter) = i-e, (2-18) 
i=1 


o 
Hence, by the usual ratio comparison test, ¥ iB(i,p + 1) converges for p > 1, as required. 
i=1 


From the definition of a the total number, n,, of different words will be ak; while the total 
number of word occurrences is k. That is 


my = 3 fli,k) = ak = aS ifli,k). (2-19) 
i=] i=1 








430 On a class of skew distribution functions 
Returning to (2-4), and using (2-8), we get 














E+} 1) pea) = a> pny, (2-20) 
k k 
ka n 
* td k i“ 
whence f*(1) ee wet (2-21) 
From (2-12) and (2-21), and by successive application of (2-11), we can compute the 
values of p, f*(1)/n,, f*(2)/n,, f*(3)/n,, etc., for given values of a (Table 1). . 
Table 1 
x p f*(L)/m f*(2)/m f*(3)/my 
0-0 1 0-500 0-167 0-083 
0-1 1-11 0-527 0-169 0-082 
0-2 1-25 0-556 0-171 0-080 
0-3 1-43 0-588 0-171 0-077 
0-5 2-00 0-667 0-167 0-067 
0-7 3-33 0-769 0-144 0-046 
0-9 10-00 0-909 0-076 0-012 























Thus far we have considered the case where a, the rate at which new words are intro- 
duced, is independent of k. We can easily generalize to the case where a is a function of k 
by making the appropriate substitution in (2-4). The equations can then be solved directly, 
but the method employed to obtain a ‘steady-state’ distribution is not applicable, since it is 
not easy to define what is meant by the steady state in this more general case. We will 
content ourselves with some approximate results for two special cases. These special cases 
will give us insight as to how a distribution function may arise which, for small values of i, 
can be approximated by (1-2), with 0<p< 1. 

Case I. Suppose the system to be in the steady state described by (2-13) with k = k,, 
and that the flow of new words suddenly ceases, so that a(k) = 0 for k > ky. We will now have 
K(k) = 1/k for k> ky, and (2-4) becomes 





1 k-1 
f(l,k+1) = (1-7) f0,8) = 1, b). (2-22) 
We define y(i) te (i = 2,...,k+1). (2-23) 


Since no new words are being introduced, we must have 


k k 
m, = f(l,k)+ & fli,k) =f k+1)+ ¥ fi,k+) 


ais k 
= Ff, + 3 Wo sisb) (2-24) 





k 
X[y(t)— Uf &) 
whence = _ oe. fl, #) (2-25) 


os tee 
= fik) > fli,k) 











ao, = eo & O&O — 


d) 


1e 


22) 


23) 


24) 


-25) 








HERBERT A. SIMON 431 











List we adie mene P - jf = = = : ) asi 

(where we suppose that p, changes only slowly with k). Instead of (2-3), we have 
fli, k+1)—f(i, k) = F16—)f— 1, k)— if, by. (2-27) 

Substituting (2-23) and (2-26) in this, we get 
yi)—1 = 7 {i+o)—i) (2-28) 
whence p; = k(y(i)— 1), (2-29) 
. 
abe ey >) rac 1) f(t, k) . fo.) =~ fe. (2-30) 
Efe) zhi 

Define Pr = (1, k)/ mp. (2-31) 
Then p=—"_ and 0<p<o. (2-32) 


Proceeding heuristically, we can see that after a becomes zero, /(1, ) will begin to decrease 
with k, and the value of p; will be larger the larger is i. For small values of i, we will have 
p(t) <p, and for large values, p(i) >. However, the tail of the distribution will be affected 
only slowly by the change in «. Hence, we may suppose that lim p(t) = py, where py is p(k,). 


ivk, 


On the other hand, since the weighted average in (2-29) is heavily influenced by the large 
frequencies for small values of i, ; will be only slightly less than p. Hence we may expect 
the distribution to take the form of a slightly curved line on a double-log scale, with a slope 
of —(f+1) at the lower end, and a slope of —(p,+1) at the upper end. If p,>2, then 
Lif (i, k) will converge. An example of such a distribution will be given in §IV. 

Case II. A second approximate solution can be obtained if we assume that « decreases 
with k, but very slowly. By definition, we have a = dn,/dk = n’. The condition for a steady 
state (all frequencies increasing proportionately) is now 


f(t, k+1) = (1+ (n'/m,))f(@, &). (2°33) 

Substituting as before, (2-7) and (2-33) in (2-3), we again obtain (2-13), where p is now 
given by vk 1 

?~ mm =n) —_ 


The slope obtained in the derivation for constant « has now been multiplied by the factor 
(n’k)/n,, which for monotonically decreasing a is less than one. Hence, the effect of a 
decrease in the rate of introduction of new words is to lengthen the tail of the distribution, 
as was also true in case I. If the new value of p is less than one, we do not have a proper 
distribution function (see equation (2-18)), hence the equation can hold only for small and 
moderate values of i, and there must be a curve (on a logarithmic scale) in the tail of the 
distribution. 











432 On a class of skew distribution functions 


III. AN ALTERNATIVE FORMULATION OF THE PROCESS 


There are some alternative ways for deriving the relation (2-13). One of these will be useful 
to us when we come, in the next section, to a more specific discussion of word frequencies 
and frequencies of publications. Moreover, this derivation avoids the difficulties we have 
encountered in the definition of ‘steady state’. 

Equation (2-10) may be written 


0 = (1—a) [6-1 f*(i—1)—#f *@))-f*) (@ = 2...) (3-1) 


where we have again written f*(i) for f(i, k). 
Similarly, from (2-4), we obtain 
0 = 1—(1—a) f*(1)—f*(1). (3-2) 

These two equations may be interpreted as follows. We consider a sequence of k words. 
We add words to the sequence in accordance with assumptions (I) and (II) of § II, but we 
drop words from the sequence at the same average rate, so that the length of the sequence 
remains k. The method according to which we drop words is the following: 

Assumption III. If one representative of a particular word is dropped, then all repre- 
sentatives of that word are dropped, and the probability that the next word dropped be 
one with exactly 7 representatives is f*(7). 

This assumption would be approximately satisfied, for example, if the representatives 
of each word, instead of being distributed randomly through the sequence, were closely 
‘bunched’. This possibility is consistent with assumption (I). 

Equation (3-1), in our new interpretation, may be regarded as the steady-state equi- 
librium of the stochastic process defined by 

fli,m+1)—f(i,m) = (1—a) [(é— 1) f(¢— 1, m) — if (i, m)| —f(i, m), (3-3) 
where m is now not the total number of words (which remains a constant, &), but the number 
of additions to (and withdrawals from) an initial arbitrary sequence of k words. Since the 
k of this process, unlike that of § IT, remains constant, the ordinary proofs of the existence 
of a unique steady-state solution will apply (see Feller, 1950, p. 373), and we avoid the 
troublesome questions of rigour that confronted us in § IT. 

The solution of (3-1) and (3-2) is, of course, again given by 


f*(i) — (L—a)(¢—1) 9. 
f*(i-—1)” 14+(1-a)i* ts 





If we were to replace the last terms of (3-1) and of (3-2), respectively, by terms corre- 
sponding to the usual form of the death process, we would have (cf. Darwin, 1953, p. 375; 
and Kendall, 1948) 


0 = (l—a){(i— 1) f*(é— 1) if *(i)] —[éf*() —(@4+ DF*(G4+)] (i = 2,...,4-1), (34) 





0 = 1—(1—a)f*(1)—[f*(1) — 2f*(2)). (3-5) 
The solution of this system of equations is 
f*() _ a) (@-1) ; 
as nr foat 


which is Fisher’s logarithmic series distribution. 


























HERBERT A. SIMON 433 


Since the log series distribution is a limiting case of the negative binomial, we may ask 
whether there is a distribution that stands in the same relation to the latter as (2-11) stands 
in relation to (3-6). We can obtain such a distribution by a modification of the birth process 
in (3-1). We assume now that the birth-rate is the sum of two components—one propor- 
tional to if(t), the other proportional to f(z). In place of (3-1) we have 


(l—a)k 


oe k+e 





[((t—1+c)f*(i—1)—(¢+c)f*(i)]—f*(i) (¢ a constant), (3-7) 


f*(t) — A(i—1 +e) (t—1+c) 


the solution of which is F*G—1) ~ AG+eyt1 = Gi+e41)A)’ 





(3-8) 
where A = k(1—a)/(k+c). 


A rather remarkable property of (3-8) is that in the tail it still has the limiting form (1-1) 
with 6 = 1. Hence for a and c small, this generalized Yule distribution will still possess the 
three properties listed in the introduction. The fact that a reasonably wide range of variation 
in the assumptions underlying the stochastic model does not alter greatly the form of the 
distribution adds plausibility to the use of such stochastic processes to explain the observed 
distributions. Our next task is to consider these explanations in more detail. 


IV. THE EMPIRICAL DISTRIBUTIONS 


In this section I shall try to give theoretical justifications for the observed fit of the Yule 
distribution to a number of different sets of empirical data. 


A. Word frequencies 


A substantial number of word counts have been made, in English and in other languages 
(see Hanley, 1937; Thorndike, 1937; Yule, 1944; Zipf, 1949; and Good, 1953). Equation 
(1-6) provides a good fit to almost all of them. When the more general function, (1-2), is 
used, the estimated value of p is always close to 1. When a convergence factor, b’, is intro- 
duced to account for the deficiency in frequencies for very large values of i, the estimated 
value of b is also very close to 1. Good (1953), for instance, applies (1-6) multiplied by a 
convergence factor to the Eldridge count, and obtains 6 = 0-999667. 

These regularities are the more surprising in that the various counts refer to a quite 
heterogeneous set of objects. In the Yule and Thorndike counts, inflected forms are 
counted with the root word; in most of the other counts each form is regarded as a distinct 
word. The Yule counts include only nouns; the others, all parts of speech. The Dewey, 
Eldridge and Thorndike counts are composite—compiled from a large number of separate 
writings; most of the others are based on a single piece of continuous prose. I would regard 
this heterogeneity as further evidence that the explanation is to be sought in a probability 
mechanism, rather than in more specific properties of language; but at the same time, the 
heterogeneity complicates the task of specifying the probability mechanism in detail. 
I shall avoid questions of ‘fine structure ’—which would require an expertness in linguistics 
that I do not possess—and confine myself to three broad problems: (1) the distribution of 
word frequencies in the whole historical sequence of words that constitutes a language; 
(2) the distribution of word frequencies in a continuous piece of prose; (3) the distribution 
of word frequencies in a ss#mple of prose assembled from composite sources. 











434 On a class of skew distribution functions 


(1) For obvious reasons, we do not have any empirical data on the cumulated word 
frequencies for a whole language. On a priori grounds, it does not appear unreasonable to 
postulate that these frequencies are determined by a process like that described in § II. The 
parameter & is then the rate at which neologisms appear in the language as a fraction of all 
word occurrences—and hence « can be assumed to be very close to zero. 

(2) The process of §II might also describe the growth of a continuous piece of prose— 
for example, Joyce’s Ulysses. But there are some serious objections to this hypothesis. An 
author writes not only by processes of association—i.e. sampling earlier segments of the 
word sequence—but also by processes of imitation—i.e. sampling segments of word sequences 
from other works he has written, from works of other authors, and, of course, from sequences 
he has heard. The modelof §II apparently allows only for association, and excludes imitation. 

The word frequencies in Ulysses provide obvious evidence of the importance of both 
processes. The fact that the proper noun ‘Bloom’ occurs 926 times and ranks 30th in 
frequency must be attributed to association. If Joyce had named his hero ‘Smith’, that 
noun, instead of ‘Bloom’, would have ranked 30th. On the other hand, ‘they’, which 
occurs 1010 times in Ulysses and ranks 27th, has very nearly the same rank—the 28th— 
in the Dewey count. In fact, of the 100 most frequent words in Ulysses, 78 are among the 
top 100 in the Dewey count. This similarity in ranking of ‘common’ words argues for 
imitation rather than association. Even for the common words, however, the variations 
in frequency from one count to another are far too great to be explained as fluctuations 
resulting from random sampling from a common population of words. The imitative process 
must involve stratified sampling, and imitation must be compounded with association. 

It is worth emphasizing again at this point that assumption (I) does not require that the 
choice of the next word from among those previously written be completely random. Suppose, 
for example, that a writer were to assign to each page he has already written a number, 
P;, Xp; = (l—a@), the size of p; varying with the ‘affinity’ of the subject discussed on the 
jth page to the subject next to be discussed. If his next word were selected by a stratified 
sampling of the previous pages, with probability p,; for each page, then equation (2-1) 
would generally be satisfied. For although individual words would be distributed unevenly 
through the preceding pages, the totality of words having a given frequency, i, in all the 
previous pages taken together would be distributed almost evenly through these pages. 
Hence. the various frequency strata would have proportionate probabilities of being 
sampled, for most choices of the p;. This is all that is required for equation (2-1). This same 
comment applies to the assumption we shall subsequently make regarding imitative 
sampling from other works. 

Let us now reconsider the problem of a piece of continuous prose. Since both the processes 
of association and imitation are involved, the sequence that is counted is to be regarded as 
a ‘slice’, of length k, of the entire sequence of words in the language, or of the entire sequence 
written by the author. Hence the word count is better described by the stochastic process 
of § III than by the process of § II. 

In determining the probability that a word selected in such a sequence be one that has 

‘occurred exactly i times, we must consider separately the process of imitation and associa- 
tion. Assume that, on the average, a fraction, #, of the words added is selected by imitation, 
and the remaining fraction, (1—/), by association. Since no new words can be introduced 
by association, the joint probability that the next word will be selected by association and 
will be a word that has already occurred 1 times is (1 — £) if(i, &)/k. 








——— 





1€ 








NN 





HERBERT A. SIMON 435 


The words selected by imitation present a more difficult problem, and we shall have to 
content ourselves with a reasonable assumption that has no rigorous justification. On the 
average, a word that has occurred i times will have a chance less than i/k of being the next 
one chosen by imitation, because in the sequence that is being sampled there are words 
that have not yet been chosen at all, and because with progressive change of subject, dif- 
ferent strata of the language will be sampled. Since words with large i will generally be 
‘common’ words, fairly uniformly distributed through all strata of the language, the 
deficiency may be expected to be proportionately greater for small ¢ than for large i. As 
a rough, but reasonable, approximation let us assume that: the joint probability that the 
next word will be selected by imitation and will be a word that has already occurred 1 times 
is B(t—c) f(t, k)/k, where 0<c<1. (Our result would not be essentially altered if we wrote 
c(t) instead of c, provided only that c(i) does not vary a great deal.) 

Adding the two joint probabilities—for association and imitation, respectively—we 
find that the total probability that the next word be one that has occurred ¢ times is 
(t— fc) f(t, k)/k. By summing this probability over i and subtracting from 1, we find that 
the probability that the next word be a new word is fc(n,/k). 

If the method of dropping words from the sequence satisfies assumption (III) of § ITI, 
we set the difference between the birth-rate and the death-rate equal to zero, and obtain 
the steady-state equation 


0 = (i—cB— 1) f*(i— 1) — (icf) f*(i) —F*(0), (4-1) 


f*@)_ _ (i-08-1) 
f*(@t—-1) (¢—cf+1) 


Again, we obtain a distribution with the required properties. 

(3) The distribution of word frequencies in a sample of prose assembled from composite 
sources can be explained along the same general lines. Again, we may regard the sample as 
a ‘slice’ from a longer sequence, but we might expect the parameters c and £ to be somewhat 
larger than in a comparable piece of continuous prose. The qualification ‘comparable’ is 
important, for c may be expected to be smaller for homogeneous prose using a limited 
vocabulary of common words than for prose with a large vocabulary and treating of a 
variety of subjects. Hence c might well be larger for the continuous Ulysses count than for 
the Eldridge count, which is drawn from newspaper sources. Indeed, the empirical evidence 
suggests that this is the case. 

There is no point in elaborating the explanation further. What has been shown is that the 
observed frequencies can be fitted by distributions derived from probability assumptions 
that are not without plausibility. 

A very different and very ingenious explanation of the observed word-frequency data has 
been advanced recently by Dr Benoit Mandelbrot (1953). His derivation rests on the 
assumption that the frequencies are determined so as to maximize the number of bits of 
information, in the sense of Shannon, transmitted per symbol. There are several reasons 
why I prefer an explanation that employs averaging rather than maximizing assumptions. 
First, an assumption that word usage satisfies some criterion of efficiency appears to be 
much stronger than the probability assumptions required here. Secondly, numerous 
doubts, which I share, have been expressed as to the relevance of Shannon’s information 
measure for the measurement of semantic information. 


which has as its solution 





(4-2) 











436 On a class of skew distribution functions 


Before leaving the subject of word frequencies, it may be of interest to look at some of 
the empirical data. Good (1953, pp. 257-60), has obtained good fits to the Eldridge count 
and to one of Yule’s counts by the use of equation (1-6). Table 2, summarizes a few of the 
data on two word counts, and compares the actual frequencies, f(1), f(2) and f(3) with the 
frequencies estimated from equation (1-3). The actual values of k and n, are used to estimate 
a = n,/k, and (2-11) and (2-21) to obtain the expected frequencies. In both cases the ob- 
served value of n;,/k leads to an estimate of pin the neighbourhood of 1-1 to 1-2. An empirical 
fit to the whole distribution of a function of the form f(i) = Ka~?t» gives an estimated 
value of p, in both cases, of about one—in reasonable agreement with the first estimate. 
A good fit to both the Ulysses and the Eldridge counts can also be obtained from (4-2), 
with c equal to about 0-2 in the former case, and close to zero in the latter. 

In the case of Thorndike’s count of 4} million words in children’s books (Thorndike, 1937), 
we may assume that the supply of new words was virtually exhausted before the end of the 
count. In his count (1) is substantially below 0-5n,, (about 0-34n,), as we would expect under 
these circumstances (see case I of § II). Thorndike estimated the empirical value of our p at 
0-45, which is entirely consistent with the observed value of 0-34n, for f(1). For, by (2-32), 
DP; = P|(P+1) = 0-31. 

Table 2 
| | 
| | #0) | f(2) | f(3) 
Word count | “a =4 Rectan | : | 
| Actual | Estimate Actual | Estimate 
| 








Actual Estimate 








| 

| 
| Ulysses (Hanley, 1937) | 0-115 | 16,432 | 15,850 | 4,776 | 4,870 | 2,194 | 2,220 
| Eldridge (Good, 1953) | 0-136 | 2,976 | 3,220 | 1,079 | 977 516 | 400 








B. Scientific publications 

At least four sets of data are available on the number, f(t), of authors contributing a 
given number, 7, of papers each to a journal or journals (Davis, 1941; Leavens, 1953). 
These are counts of (a) papers written by members of the Chicago Section of the American 
Mathematical Society over a 25-year period; (b) papers listed in Chemical Abstracts (under 
A and B) over 10 years; (c) papers referred to in a history of physics; and (d) papers and 
abstracts in Econometrica over a 20-year period. 

We may postulate a mechanism like that of § ITI, equation (3-1). The authorship of the 
next paper to appear is ‘selected’ by stratified sampling from the strata of authors who have 
previously published 1, 2,..., papers, the probability for each stratum being proportional 
to if(i). Again, the probabilities for individual authors need not be proportional to i, but 
only the probabilities for the aggregates of authors with the same i. For example (as in the 
ease of words), the probability for a particular author may be higher if he has published 
‘recently than if he has not. The gradual retirement of authors corresponds to assump- 
tion (ITI). 

A comparison of the actual frequencies, for i from 1 to 10, with the estimated frequencies, 
derived from (2-11) and (2-21), is shown in Table 3. The fit is reasonably good, when it is 
remembered that only one parameter is available for adjustment. However, it should be 





not 














HERBERT A. SIMON 437 


noted that the estimated frequencies tend to be too high fori = 1, 2 and too low fort = 3,..., 
10. In three of the four cases, they are again too high for the tails of the distributions. 
A further refinement of the model is apparently needed to remove these discrepancies. 


Table 3. Number of persons contributing 






































Chicago Math. Soc. | Chem. Abstracts Physicists Econometrica 
No. of 
contributions 
Actualt | Estimate | Actualt |Estimate} Actualt | Estimate | Actual{ | Estimate 
1 133 i 3,991 4,050 | 784 824 436 453 
2 43 46 1,059 1,160 | 204 217 107 119 
3 24 23 493 522 | 127 94 61 51 
+ 12 14 287 288 50 50 40 27 
5 ll 10 184 179 33 30 14 16 
6 14 7 131 120 28 20 23 ll 
7 5 5 113 86 19 14 6 7 
8 3 4 85 64 19 10 11 | 5 
9 9; 13 3} 10 64 49 6} 32 8} 24 1} 12 4} 12 
10 1 7 65 38 | | o| 3 
1l or more 23 30 419 335 48 52 22 25 
Estimated a 0§ 0-30 0-39 0-41 
Estimated p 0-916§ 1-43 1-64 1-69 
k 1,124 22,939 3,396 1,759 
Ny 278 6,891 1,325 721 
p’ 1-07 —- — —_ 




















t Davis (1941). 
t Leavens (1953). 
§ p=p estimated in this case from (2-31) to (2°32). 


C. City sizes 
It has been observed, for every U.S. Census since the early nineteenth century, and for 
most other Western countries as well, that if F(i) is the number of cities of population 
greater than 7, then F(i)~ Ai-, (4:3) 


where p is close to 1 (see Zipf, 1949, chs. 9, 10). 

Again, we would expect such a distribution if the underlying mechanism were one 
describabie by equations like (2-3) and (2-4). Such a mechanism is not hard to conceive. 
First, equation (2-3) would hold if the growth of population were due solely to the net 
excess of births over deaths, and if this net growth were proportional to present population. 
This assumption is certainly satisfied at least roughly. Moreover, it need not hold for each 
city, but only for the aggregate of cities in each population band. Finally, the equation 
would still be satisfied if there were net migration to or from cities of particular regions, 
provided the net addition or loss of population of individual cities within any region was 
proportional to city size. That is, even if all California cities were growing, and all New 
England cities declining, the equation would hold provided the percentage growth or 
decline in each area were uncorrelated with city size. 














438 On a class of skew distribution functions 


In the case of cities, equation (4:3) could only be expected to hold down to some minimum 
city size—say, 5000 or 10,000. The constant « would then be interpreted as the fraction of 
the total population growth in cities above the minimum size that is accounted for by the 
new cities that reach that size. 


D. Income distribution 


Vilfredo Pareto is generally credited with the discovery that if personal incomes are 
ranked by size, the number of persons, F'(i), whose incomes exceed i can be approximated 
closely, for the upper ranges of income, by equation (4-3) with p usually in the neighbourhood 
of 1-5 (Davis, 1941; Champernowne, 1953). Hence, the income distributions bear a family 
resemblance in their upper ranges to those we have already considered, although the 
parameter, p, is substantially larger than 1—its characteristic value in the case of word 
frequencies and city size distributions. 

A stochastic mechanism similar to those described in § III would again produce steady- 
state distributions closely resembling the observed ones. We picture the stream of income 
as a sequence of dollars allocated probabilistically to the recipients. If the total annual 
income of all persons above some specified minimum income is k dollars, the segment of 
this sequence running from the mth to the (m + k)th dollar is the income for the year begin- 
ning at time m. We assume that the probability that the next dollar will be allotted to some 
person with an annual income of ¢ dollars is proportional to (t+ ¢)f(i), with c positive but 
small. This represents a modification of assumption (I) that decreases the proportion of the 
total stream going to persons of high income relative to the proportion going to persons with 
incomes close to the minimum. We assume that a fraction of the dollars is assigned to new 
persons—i.e. persons reaching the minimum income to which the assumptions apply 
(assumption (IT)). We assume that there is considerable variance among persons within 
each income class in the probability of receiving additional income, so that the rate at which 
dollars are dropped from any income class as m increases satisfies assumption (IIT). Then 
we obtain again equation (3-8), which now holds for ¢ greater than the minimum income. 
For large 7, this distribution has the required properties with 1/A = p. 

The same result has been reached by D. G. Champernowne (1953), following a somewhat 
different route. He divides income recipients at time ¢t, into classes of equal proportionate 
width. That is, ifi,, is the minimum income considered, then the first class contains persons 
with incomes between i,, and 7i,,, the second class, persons with incomes between ri,, and 
r%i,,, and so on. Next he introduces transition probabilities p,,, that a person who is in class 
g at time ¢t, will be in class A at time ¢,. He assumes that p,, is a function only of (g—h). 
Now, by his definition of the income classes, the average income of persons in class g will 
be about r¢—” times the average income of persons in class h. Hence, the expected income 
at t, of a person who was in class g at t, will be 


= Pon i, = LPo-wre Mig = ai, (a a constant), (4-4) 


where i, is the average income in class g. Prof. Champernowne assumes explicitly that « < 1. 
From this it is clear that his model satisfies our assumptions (I) (in its original form) and (Il). 
Further, since he assumes a substantial variance in income expectations among persons in 
a given class, our assumption (III) is also approximately satisfied. Hence, in spite of the 
surface differences between his model and those developed here, the underlying structure 
is the same. 





num 
on of 
y the 


3 are 
ated 
hood 
mily 
| the 
word 


ady- 
ome 
nual 
nt of 
egin- 
some 
> but 
f the 
with 
new 
pply 
ithin 
‘hich 
Then 
ome. 


what 
nate 
‘sons 
and 
class 
—h). 
r will 
ome 


(4-4) 


t<l. 
(II). 
ns in 
f the 


ture 





HERBERT A. SIMON 439 


E. Biological species 

We conclude this very incomplete list of phenomena exhibiting the Yule distribution by 
mentioning the example originally analysed by Yule himself (1924). It was discovered by 
Willis that the number, f(z), of genera of plants having 7 species each was distributed approxi- 
mately according to (4-3), with p< 1. Yule explained these data by a probability model in 
which the probability, s, of a specific mutation occurring in a particular genus during a 
short time interval was proportional to the number of species in the genus; while the 
probability, r, of a generic mutation during the sume interval was proportional to the 
number of genera. Starting at t, with a single genus of one species, he computed the dis- 
tribution f(i,t) for t,,t,,..., and found the limit as t--co. This limiting distribution corre- 
sponds to (2-13) with p = r/s. Yule observed that for r <s (as required to fit the empirical 
data), this was not a proper distribution function, and obtained the approximate dis- 
tribution for t = 7’. His procedure was equivalent to replacing the complete Beta function 
in (2-13) by the incomplete Beta function, taking as the upper limit of integration an 
appropriate function of 7’. 

If, in the process of § II, we define k as the total number of different species and f(i, k) as 
the number of genera with exactly i species, we see that our k& is a monotonic increasing 
function of Yule’s t (specifically, k = e). Making the appropriate transformation of vari- 
ables, we find that Yule’s assumption with respect to the rate of specific mutation corre- 
sponds to our assumption (I’) (and hence is considerably stronger than the assumption we 
employed in § II). Making the same transformation of variables with respect to his assump- 
tion of a constant rate of generic mutation, we find that n, = e*. We can then compute 
a(k) (which will now vary with k) by taking the derivative of n, with respect to k. We obtain 


a(k) = rer—st/s, (4-5) 


If we substitute these values in equation (2-34) of case II, where we assumed slowly 
changing a, we find in the limit, as too, p = r/s, as required. Hence, we see that the 
process of §II is essentially the same as the one treated by Yule. 

It is interesting and a little surprising that when Yule, some twenty years after this dis- 
covery, examined the statistics of vocabulary, he did not employ this model to account for 
the observed distributions of word frequencies. Indeed, in his fascinating book on The 
Statistical Study of Literary Vocabulary (1944) he nowhere refers to his earlier paper on 
biological distributions. 


V. CONCLUSION 


This paper discusses a number of related stochastic processes that lead to a class of highly 
skewed distributions (the Yule distribution) possessing characteristic properties that 
distinguish them from such well-known functions as the negative binomial and Fisher's 
logarithmic series. In §1, the distinctive properties of the Yule distribution were described. 
In §§ II and III several stochastic processes were examined from which this distribution can 
be derived. In §IV, a number of empirical distributions that can be approximated closely 
by the Yule distribution were discussed, and mechanisms postulated to explain why they 
are determined by this particular kind of stochastic process. In the same section, the 
derivations of §§ II and III were compared with models previously proposed by Yule (1924) 
and Champernowne (1953) to account for the data on biological species and on incomes, 
respectively. 


29 = Biom. 42 











440 On a class of skew distribution functions 


The probability assumptions we need for the derivations are relatively weak, and of the 
same order of generality as those commonly employed in deriving other distribution 
functions—the normal, Poisson, geometric and negative binomial. Hence, the frequency 
with which the Yule distribution occurs in nature—particularly in social phenomena— 
should occasion no great surprise. This does not imply that all occurrences of this empirical 
distribution are to be explained by the process discussed here. To the extent that other 
mechanisms can be shown also to lead to the same distribution, its common occurrence is 
the less surprising. Conversely, the mere fact that particular data conform to the Yule 
distribution and can be given a plausible interpretation in terms of the stochastic model 
proposed here tells little about the underlying phenomena beyond what is contained in 
assumptions (I) through (III). 


REFERENCES 


CHAMPERNOWNE, D. G. (1953). A model of income distribution. Econ. J. 63, 318. 

Darwin, J. H. (1953). Population differences between species growing according to simple birth and 
death processes. Biometrika, 40, 370. 

Davis, Haroxp T. (1941). The Analysis of Economic Time Series. Principia Press. 

FELLER, WiLu1aM (1950). An Introduction to Probability Theory and Its Applications, vol. 1. Wiley. 

Goop, I. J. (1953). The population frequencies of species and the estimation of population parameters. 
Biometrika, 40, 237. 

HANLEY, Mites L. (1937). Word Index to James Joyce’s Ulysses. University of Wisconsin Press. 

KENDALL, Davin G. (1948). On some modes of population growth leading to R. A. Fisher’s logarithmic 
series distribution. Biometrika, 35, 6. 

LEAvVENS, Dickson H. (1953). Econometrica, 21, 630. 

MANDELBROT, Benoit (1953). An informational theory of the statistical structure of language. In 
Communication Theory (ed. by Willis Jackson). Butterworths. 

THORNDIKE, Epwarp L. (1937). On the number of words of any given frequency of use. Psychol. 
Rec. 1, 399. 

TrtcHMarsH, E. C. (1939). The Theory of Functions, 2nd ed. Oxford University Press. 

Yue, G. Upny (1924). A mathematical theory of evolution, based on the conclusions of Dr J. C. 
Willis, F.R.S. Phil. Trans. B, 213, 21. 

Yue, G. Upny (1944). The Statistical Study of Literary Vocabulary. Cambridge University Press. 

Zier, GEORGE KINGSLEY (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley 
Press, 











[ 441 ] 


SIMULTANEOUS TESTS OF LINEAR HYPOTHESES 


By M. N. GHOSH 
University of North Carolina and University of Calcutta 


1. INTRODUCTION 


A very common situation in the analysis of variance of survey data, where the investigator 
is not able to put his experimental data in the framework of a design, planned in advance, 
is that the estimates of the parameters are correlated in various ways and the analysis of 
variance becomes cumbersome. Even for the analysis of variance of a two-way classification 
with unequal class frequencies one meets with this difficulty. We shall be concerned here 
with the analysis of such data and the test of significance for groups of parameters repre- 
senting different aspects of the problem. We consider below a concrete example, in which 
standing height, sexual maturity characters and certain blood chemicals, like haemoglobin, 
ascorbic acid, carotene and alkaline phosphates were measured for a number of girls between 
the ages 9 and 14 years.* For the purpose of analysis the girls were divided into twenty-five 
classes according to sexual maturity ratings from indices of breast (B,;) and pubic hair 
development (Ph,), each of these being a 5-point rating. 
The following model is assumed: 


Ysi,j) = Hig + Prt nt BoXe,~n+ PoXa,~nt+ Vet ens (1-1) 


where (Y,, 21,4; Ve,~,%3,~> 2%) (k = 1,2,...,N) are sample individuals. y,, ;, represents the 
height increment of the kth girl in the maturity class (B; Ph;), 7, ,., %2,,, %3,, are nutritional 
variables haemoglobin, ascorbic acid and carotene respectively, and z, is alkaline phos- 
phatase. We thus consider three groups of parameters {y;;} (i,j = 1, ...,5), {B,, 2, &3} and 
{y}. The hypotheses corresponding to these three groups of parameters are 


H,: hi =P (i,j = i eoee et) 
H,: £, = 9, B, = 0, B, = 9, (1-2) 
Hy: y=0. 


Rejection of the hypothesis H, would imply that the height increment depends upon sexual 
maturity, the rejection of H, would imply that the blood chemicals, which depend upon 
nutritional status, affect height increment, and the rejection of H,; would imply that 
alkaline phosphatase affects height increment. 

The hypotheses H,, H,, H, may be tested by the analysis of variance method, but these 
tests are not independent, because of the non-orthogonality of the groups of estimates as 
well as for the fact that we have to use the same estimate of error variance. We shall intro- 
duce the notion of quasi-independent tests of multiple hypotheses, which is appropriate in 
such situations. 


* The data were collected by Dr Hughes Bryan, Professor of Nutrition, School of Public Health, 
University of North Carolina. 


29-2 











442 Simultaneous tests of linear hypotheses 


 # QUASI-INDEPENDENT TESTS OF MULTIPLE HYPOTHESES 


For any test of significance we consider the first and second kinds of error. Let the hypo- 
Chagen. tented be Hy: 0,=0, Hy: 0,=0, Hy: 05 =0. (2-1) 
The tests 7,, T,, T; of H,, H,, H; will be called quasi-independent when 


Probr, {acceptance H, | 0, +0, 4,3} = Prob7, {acceptance H, | 0, +0, 0,=0,0,= a (2-9) 
Prob,y, {rejection H,|0,=0,4,,4,} = Proby, {rejection H,|0,=0,0,==0,03;=0}, me 


hold for all values of @, and @;, i.e. whether H, and H, are true or not; and similar equations 
hold for 7, and 7. where Prob, { } denotes the probability of the statement in parentheses 
for the test procedure 7), etc. Thus for quasi-independent tests of hypotheses H,, H, and H, 
the first and second kinds of error for each hypothesis do not depend on the parameters of 
the other hypotheses. 

Two tests of hypotheses H, and H, with the critical regions C’, and C, are independent if 
the following holds for all 6, and 0: 


Prob{X CC,, X CC, | 4,,,} = Prob{X CC, | 6,}, Prob {X CC, | 4}, 


X being the observed sample. 
It will be seen later that the usual analysis of variance tests for multiple hypotheses are 
quasi-independent, and we most often do not need independent tests. 


3. CONTROL OF ERRORS IN THE SIMULTANEOUS TESTS OF HYPOTHESES 


There may be different points of view for assigning significance levels, in the case of simul- 
taneous tests of hypotheses. In certain situations, where the decisions regarding the hypo- 
theses H,, H,,...,etc., are unrelxted, it is proper to consider the significance level of each 
hypothesis individually at 5 or 1 % level (say). But when the decisions have a joint import, 
it is proper to consider the first kind of error of the simultaneous test of H, and H,, as the 
rejection of at least one of the hypotheses, when all of them are in fact true. 

The significance level of the simultaneous test is defined as the probability of rejecting at least 
one of the hypotheses, when all of them are true. 

Suppose in the test of a new variety of crop against a common variety used as a control 
we are interested in two different characters and the new variety may be considered desir- 
able when it is superior in either of these characters. Since one would be interested to know 
which particular character in the new variety is superior to the control, a simultaneous test 
should be done. In this case, the significance level, as some sort of measure of the amount 
of caution implied in the test, should naturally be that of the simultaneous test, since one 
would like to insure, at a certain level, against declaring the new variety superior, when it 
is actually no better than the control. 

A similar situation exists in quality control for acceptance of material, when a number 
of characters are examined and the material is rejected when it is not up to the mark with 
respect to any of these characters. It may be useful, in this case, to determine the particular 
deficiency in any of these characters, and so we must use a simultaneous test, and the 
significance level of the simultaneous test should be used to insure against too frequent 
rejections. Of course, we pay for this by widening confidence intervals for the parameters 
measuring these characters. 














M. N. GuosH 443 


The need for a safety device like the simultaneous significance level would be even more 
apparent for the agricultural or quality control example stated above, when the number of 
characters is large, in which case the chance of declaring a new variety different from the 
common variety may be much larger than 5%. However, the control of the first kind of 
error in a simultaneous test is achieved only at the expense of increasing the second kind 
of error for individual characters. In any particular problem, whether the point of view of 
individual or simultat:eous significance level should be adopted depends, roughly, on how 
much a priort weight one attaches to the alternative hypotheses, either from theoretical 
expectations or from considerations of cost in replacing the old variety by the new variety. 
If this cost is small, e.g. if the varieties are grown only on an experimental scale, one would 
be relatively free to decide on either of the varieties as superior and the individual signi- 
ficance level would be the proper one. However, with varieties established in agricultural 
practice, any statistical decision in favour of a new variety would involve large expenditures 
and the statistician should take an attitude of caution. The simultaneous significance level 
takes account of this attitude and is thus appropriate in such cases. 

In the analysis of variance problem considered before, the conventional statistical 
procedure of (a) testing for the overall SS for fitting constants, (b) testing separately for the 
SS of H,, adjusting for H, at 5% level, etc., would give quasi-independent tests at individual 
levels of significance 5°%. But the problem of obtaining the simultaneous significance level 
in this case is mathematically a very intractable one, and we shall find an upper bound for 
the significance level. This would give a control of the first kind of error. The operating 
characteristic would, of course, be of the same nature as for the usual analysis of variance 
of multiple hypotheses, since the test procedure is essentially the same, except that the 
exact significance level is not known but only its upper bound. 

The notion of simultaneous level of significance has already been considered in various 
ways and languages by Scheffé (1953), Tukey (1952) and Nandi (1951), and the practical 
implications have been thoroughly discussed by Tukey in a mimeographed report. 


4, SIMULTANEOUS TESTS AND SIGNIFICANCE LEVELS 


Consider the linear model 
E(y;) = 0,,P,+---+@in Pm (¢ = 1,2,...,%), (4-1) 


y,; being independent normal variables with unknown variance o?, and p,,...,P,_, are un- 
known parameters. Let rank (a;;) = Nj and let 7, ... 7, be estimable linear functions of the 
parameters 7, ..-;Dm 


7, =U.) thpPot---tlenPm (K = 1,2,...,R), (4-2) 


such that the coefficient vectors (1,,, ...,,,,) form a vector space of rank R < Ny. We consider 
the following linear hypotheses, 


Siu Cee Ae Noe eth dost vomea es See ters (4:3) 
Hy: Wythe eetkg—1 41 = 0, vlad TW yyteec tks = 0. 
Let Y,,..-.¥,3 Yeysas «++>Veqings -++2 Yaqts..tey De the best linear estimates of these parameters, 


obtained by the method of least squares. We shall sometimes denote the coefficient vectors 











444 Simultaneous tests of linear hypotheses 


of these linear functions by the same symbol, so that we have the alternative notation for 
the linear function Y; = (Y;, y), where (Y;,y) is the scalar product of the vector Y; and the 
vector y. From Markoff’s theorem we have an independent estimate of error variance, 
S?, with n,d.f., which is independent of the parameters ,, ..., p,,- Quasi-independent tests 
of the hypotheses H,, can be made by considering the linear functions 


(Y,,_.+1Y) os (Y,,,» y) (6b, =K,+...+K,), 


whose expectations are 7, .1,---,,- Let U,,_,1,---, U,, be orthonormal vectors forming 
a basis of the vector space formed by Y,,_.,,,-.--,¥,,- Then (U;,y) is a linear form in 
(Y;,y) (t,9 = 6,-4+1,...,6,) and 


bn On 
E{(U;,y)} = 2% EE 9) = an; from E(Y;,y) =7;. (4-4) 
j=bn- 


mai j=ba_,+1 


bn 
On the hypothesis H,,.x%/o7= > [(U;,y)—E(U;,y)}* has a x?-distribution with x, df. 
1 


and y2/(«,S?) has an F-distribution with d.f. (r,,). The second kind of error for the test 
of H,, depends only upon the parameters 7, _| 41, ---, %,- 

We now consider sets M,, M,,... of vectors Y,,...,¥..3 Yejsi-++>Vevtngs +++ belonging to 
hypotheses H,, ..., H,, so that all vectors belonging to the same hypothesis H; belong to the 
same set M,. Vectors belonging to different sets are orthogonal, and if the vectors belonging 
to two hypotheses H; and H; are orthogonal they belong to different sets. These sets we shall 
call orthogonal sets and hypotheses belonging to two different sets, orthogonal hypotheses. 
We shall consider different cases according as an orthogonal set consists of a single hypo- 
thesis H; or more. 

Case I. Let all hypotheses H,, ..., H, be orthogonal so that each orthogonal set M, consists 


of a single hypothesis. Let Uj, ...,U..3 Upsty «++» egtugi «++ Ueya...+e, be orthonormal systems 


of vectors in the spaces determined by Yj, ..., ¥..; Yepias +++» Vagina? «++» Viqs...te, PeSpectively. 
Let 
E{(U;, y)} — ®,. 
We shall consider for simplicity s = 3. The joint distribution of 
ie ' Ki +k, Kitkaths 
i= DU. y)-OP, = = (G.y)-O#. x= L MGn.w-O}, (#5) 
= i=x,= bKy tket 


and S? is given by 
l 
const. exp) — 54 [xi + X3+XE+M 3] (x3) Be1—® (x9) a2) (78) Bca—2) (S2) He dF dyed x3dS2. 
2 2 2 
Put XL =G _X2_ = _X3_ = G.. 
nest n,S? » 2S » 1,82 . 
Making the transformation, integration for S? gives the distribution of G,, G,, Gs as 


4(x,—2) Ke—2) gy}(K3—2) 
GHG OG iG ded, (4-6) 


em Ia Rw pn a 2: 
(Ky Ko, Kg "e) 4G, + y+ Gy )Merrereatn) 

















We 


P( 
regi 


C(k 


yy 


ne 
e, 











M. N. Guosu ; 445 


We now consider the region defined by G,<A,, G,<A,, G3<A; which has the probability 
P(Ay, Ag, As), say. If P(A,,A,,A3) = 1—a, we have the system of simultaneous confidence 
regions C,,C,, ... for the sets of parameters with the confidence coefficient 1— a, as 


Cyr yy e005 Mey? XO, y) — D,}2 <A, n, 82; 
= 
Ky +Ks 
Cor Merry ses Megteg? x M9) —O}? <A,n, Se, r (4-7) 
Ky t+KetKs 
Cs: TWey+ke+)? =e Tey testks: >> {(U;, y) ie ®,}? < Aan, S2. 
i=K,+K+1 





The best choice of A,, Ag, Ag, i.e. to give the smallest confidence regions, is not known and 
needs further investigation. From considerations of degrees of freedom we may consider 
Ay/Ky = Aq/Ky = Ag/Ky = A. 


5. EVALUATION OF SIGNIFICANCE LEVEL 
To determine A from (4-6) we have 


qt K—2) Gis 


(1+G, + Gy + Gg)Berteteatme 





K,A KA KsA 
C(k,, Ke, Kg; no aa) dG,| d@, l-a, (5-1) 
0 0 0 


where @ is the significance level. Put t, = G,, t, = @,/(1+G;,), t; = G,/(1+G,+G,), then the 
above becomes 











KA gh) ge KAM(L+t) — ha-2) ge KeA/((L+0,)(1 +69] th*2-2) ge 
Oki Kas Kai nd) , ; If 4 ___— atl : 3 “|: 


(1+t,)iatnoL Jo (1+ tp)terteatne g (1 +ty)Herteetestne) d 

(5-2) 
which can be evaluated by successive numerical integration and using Pearson’s Tables of 
Incomplete B-functions. 

For any confidence coefficient 1—a, the calculation of A depends upon calculations of 
iterated integrals, and tables have to be prepared for these. One difficulty of preparing such 
tables is that the integrals depend upon the parameters k,, K2, Ks; ”,, etc., and thus unless 
the number of parameters can be reduced, construction of tables would be difficult. We may, 
however, get an upper bound for the significance level «, from an inequality of Kimball 
(1951) given below: 


Pr {| @,| <Aj, | @_| <Ag, | Gy | <As}> Pr{| G, | <Aj} Pr {| @_| <Ag} Pr{| Gs] <As}. (5-3) 

Here we choose A, Az, A, so that 
Pr {| G,| <A,} = Pr{| @,| <Aq} = Pr{| @s| < Ag}. (5-4) 
Thus if Pr {| @,| <A, | @| <Ag, | @| <Ags} = 0-95 we have to make Pr {| G, | <A,} = 0-983. 
Now v,G@, has an F-distribution with x, and n,d.f. so that A, = “! F,(k,,”,), where F,(x,n,) 


e 


is the «% point of the F-distribution with d.f. (x,,”,). We thus have 
> { Ky Ke Kg i 
i r || G,|< wy, Fa(Ki Me)» |@,|< wy Fa'(Kas Me)» | Gs|< “2 Folk, ne) 2(1l-a@’)>=1l-—a say. 
e e € 


Thus a’ = 1—(1—a). (5:5) 











446 Simultaneous tests of linear hypotheses 


Case II. We now consider a set M,, which consists of more than one group of linear func- 
tions, say it contains linear functions corresponding to the hypotheses H, and H,. We shall 
call this a compound set. In this case the linear functions belonging to H, and H, are non- 
orthogonal. We shall show that there is no non-singular linear transformation by which 
these linear functions can be transformed into mutually orthogonal sets corresponding to 
hypotheses H, and H, respectively. Thus the basic inequality (5-3) is not directly applicable. 

Let U,,...,U,,3 Us,1s +++» Usyix, be an orthonormal basis of the vector space formed by 
) ere Oe? Mare npere ee ,80 that U,, ..., U,,is a basis of the vector space formed by Yj, ..., ¥..,. 
As ttn let E{(U;,y)} = ®;, then ®, are linear functions iy Bag tas 


Ky Tet + 4 +9 Me tKy 


Let Y, be the normalized vector corresponding to Y,, i.e. Y; = , where “i Y; | is the norm of 


1%] y |’ 
the vector Y; and let 7; = ;—+. The relation between U-vectors and Y-vectors is expressed 


b oa Y, |" 
&)-( (Ci): 


where « = (a; ;) is a (kK, x K,) matrix, y = (y;,;) is a (k,x Ky) matrix, # = (f;;) is a (Ky x K;) 
matrix, Y; is a (1 x x,) matrix with components ¥,,..., ¥,,, Yy; is a (1 x x.) matrix with com- 


*9 Ky? 
ponents z were Y.., +x, and similarly U; is a (1 x x,) matrix with components U,, ..., U,,, ete. 
From the choice of the basis U,, ..., U,,; Up,41) «++» Geng» it is clear that the séitadian a and y 
are non-singular and / + 0, since the vectors Y,,...,¥,, and ¥,.. 41, ..-,¥..4., are non-orthogonal. 


Inverting the equation (5-6) we get 


(c, 4 (ines ) (F) (5-7) 


Replacing the vectors Y; by #; and U; by ®,, the corresponding relation must hold between 
the ®’s and the 7’s. Hence ®,,...,®,, depend upon 7,...,7,, alone but ®, ,,,...,®..4,, 
cannot be expressed solely in terms of 77, ,1, ---, 7,4,» Since (ya) +0. Thus the sets of 
linear functions U,,...,U,, and U,,,,,.--,U,,+., do not provide quasi-independent tests of 
H, and H,. We shall see later i in II (c) that this can be done when the transformation is sin- 
gular, i.e. when we chose a suitable subset of x, —«, vectors from the vectors U,,, 1, ..., Ue.4y: 

We shall also consider two other methods of setting inequalities to reduce the ALTE to 
orthogonal sets. The essence of these methods is to cover a complicated region C by a sphere 


S and using the probability of the sphere for the region C as an upper bound. 


Method (a). Method of simultaneous confidence intervals for all contrasts 
This method is due to Scheffe (1953), Tukey (1952) and Roy & Bose (1953). It can be used 
for the simultaneous test but it also gives the confidence intervals for all linear functions of 
the parameters 7, ..., 7,3 Mq,415 «++ <,+x, Simultaneously with a given confidence coefficient. 
In equation (5-6) since both U’s and Y’s are normalized vectors 


Ky ) 
Doni =1l (n= 1.,...,&), 
den 


z (5-8) 
THA.+ v= 1 (n= 1, ...,Kq)- 
i=1 i=1 
Also (Puy) = Daea(Uy) (n = 1,2,...,%), 
ass (5-9) 


| aoe y) = Dh a(O, y) + Bnd Oem y) (n 1, 2, wee Kg), 








ar 


F 
tk 


8: 


Th) 


ed 











M. N. Guosu 447 


and similar equations hold between 7’s and ®’s. From Schwartz’s inequality 
. Ki tk, 
[Fn~) APS EZ [Opy)- OP (m = 1,2, -005 K+ Kp). 
j= 


Ky +k, 
Hence Pr {| (Y,; y)—7, | <d (n= Ls vos +K)}> Pr x ((G;,y)— 9}? < 6*). (5-10) 
j=1 
Kitks 
For the set M, we now use x7 = > [(U;,y)—©,]*in (4-7). This method has the advantage 
j=1 


that if 7 is a linear function of 7,’s, then at the same time, with the same confidence 
coefficient, we have the confidence interval for 7, 


ofa Oy 7, Feet Ber tee Wey tng? 
} (5-11) 
oY, + ts 5 Xs tke at = 6 V(Za%). 
Method (b) 
From the equation (5-6) we have Y,, = BU, + Wy. (5-12) 


Since the matrix y is non-singular 
YY = y BU, + Oy. (5-13) 


Let y Yj, = 2;, which is a column vector with components Sesto +++ Sey4xq2 aNd let Z, 43, +++; 
Z,,+«, be the corresponding normalized vectors, then 


Fey) = Bip y) + + Bie YW) + V1 Gapsa> YW), bah 
(Zc,+499Y) = Bi, (Uy) )+... +B, Ua) + Vig rue y)s 
so that SP +yi? =1 for n= k,+1,...,«,+,, and similar equations hold between the 
sitpectell vebhien y; = E(z,;,y) and ®;, = E(U;,y). Hence from Schwartz’s inequality 
[EnV Val < SU py) — OP + (Cn) — OP (m= Ky 1, ky +Ke), 
we (5:15) 
“S(t Val <he E (Opry) Oi + "S (Cyy)— OF. 


Since {z,,} are linear functions of Y,..,,...,%osu: Yn = Bn y) (nm = «,+1,...) are linear 


functions of parameters 77,, 1, ---;7x,+x,» 2 confidence region for these parameters is given by 
> Ki +k e 
ts Gn y) —VP< < Ke = {( (U; »Y) — 0, ae x UG, ~n? y)- ®,,) S C2; (5 16) 
n=Ky+ n=1 n=K, 


while a confidence region for the parameters 7, ...,7,, is given by 


¥((U,.y)- 0,2 <O,. (5-17) 
n=1 ' 


When there are more than two hypotheses in the set M,, e.g. if there are three hypotheses 
H,, H,, H; with the groups of parameters 7), ... Tayi 7 7 Westin dy ++29 Wedachtte 
with best linear estimates Y,,...,¥..; Yes ---¥, 


Kyth? °°°? Ky +Kq? 


‘al catia) Leytegtt> +++» Leytnetegs WE use the same 











448 Simultaneous tests of linear hypotheses 


method as before for the hypothesis H, and by combining the hypotheses H, and H,. We then 
have the following confidence regions for the three groups of parameters: 


(7, “S¥e T,,): LU, y) nT ®;}? S C, 


Ky tks Ky KitKs 
(W415 ooey Westecy)? p> [Gn - PnP < 2% (Ui,y) — OP + , x IG, y)—- 9, <G,, r 
n=kK\+ i= i=K 


Ky tKaetKs 


Ky tks Ky tke tks 
Teythe t+)? 009 Mes pugtns)® = [(2,; Y)— Val? < Ks z ((U;, y)—-O P+ p> ((U;, y) — 9, <C,. 


N=Kyt+Kytl i=K, +k, +1 





(5-18) 
For the test of the hypotheses 
Ay: m= 0, ..., 7, = 9 
H,: T41 = 90, oe Teytxy = 9 
Ay: Tey tKe+1 = 0 “99 Tey+ke+Kks =0 
we consider = Dry Fy = x9/(«, 82), 
i=1 
= ES (Cw = rile). | (5-19) 
i=«,71 
= SD (Uy Fy=x9l(eyS), 
i=Ky +kgt+l1 





Since ®; = 0 on the given hypotheses, for all i = 1,2,..., K, +Kg+ 3. If the a % point of 
F,, F,. F; (corresponding to G,, G, @,) are €,, €g, €3, then we consider the significance limits 


l 1 1 : 
of = x’, = x, “a x3 for the test of hypotheses H,, H,, H, respectively as n, S%e,, n,S2(k,€, + €) 


and n,S2(K,€, +K,€,+€3). The upper bound of the significance level is then a. 


Method (c) 


Let U,,..., Us Uesrs-++> Use, be an orthonormal basis of the space formed by the 
vectors Yj, ...,¥.,3 Yijsas-++sYivexgr 80 that U,,...,U,, is a basis of the space of the vectors 
Y,,....¥,,. We shall show that when x, <x , it is possible to find orthogonal sets of x, and 
Ky — x, linear functions for the hypothesis H, and H,,so that the inequality (5-3) could be used. 


In (5-6), since the rank of # is at most x,, by a transformation of the vectors ¥,, ..., Y, 


x,’ 


P,. .1s «+++ Fe, saqs the matrix # can be reduced to the triangular form with x, — x, rows of zeros, 
ie. the transformed vectors Yj, ..., ¥/.; ¥i..,,...,¥..,, would be given by 
J 1 1 2 = 
Fett = AiG t+ Bieler... + Bie Uy t+ ViwUger tet Vier Uerey — ) 
, ” y y P ” y ’ ” T | 
) re = B3,2Ua + -»- + BS, , Oe, + ¥2, 10 41 ete ee ee | 
iemaammmmaiae™ ye + eae Sorc epee =o me (5-20) 
Yoctt = Ye 42,1 ga F +++ + Yey+t.0q Tey? 
} Kytk, Veep, Oe, +1 +. + , a! 
Let W,,...,W,_,, form an orthonormal basis of the vector space of | re Y ae) We 
then have Ks 
E{(W,,y)} = Deus Mees = Ose (b= 1, -.-5K2—))- (5-21) 
d= 











The 
imp! 


since 
~ 
Hy ( 
‘ 
Hy a 


7 


Ky + 


and 


8) 


9) 


D) 











M. N. Guosu 449 


The hypothesis Ay: 4, = 0, «2, @ =0 


Ky +Ke ’ 
implies Ay: O,4,=9, ..-, O45. = 9, 
since ®, ,1,.--, D,.4,, are linear functions of 7,, 1, ..-,7,+«, and have the rank x,.—K«,. Now 


H,;, (i.e. rejection of H}) implies H, (i.e. rejection of H,). We therefore test for the Rie 


H; and reject H, only when H; is rejected. A confidence region for the parameters 7, ,,,.... 


T.,+«, 18 obtained from. = [(W;, y) — ®,,.,]* which is distributed independently of 
UG; y) a o,}*, 
and these sums of squares may be used in equations (5-3) and (4-7). 


6. RELATIVE MERITS OF METHODS (a), (b) AND (c) 


The method (6) obviously gives a much closer inequality than (a), although the latter has 
the additional advantage that it gives confidence intervals for all contrasts of the para- 
meters in the hypotheses H, and H,, in the set M,. However, in most practical cases, the 
parameters entering into the hypotheses H, and H, relate to quite distinct characters, and 
thus there would not, usually, be any need to consider contrasts involving both sets of 
parameters. 

In method (c) we use orthogonal sets of linear functions Y,,..., a - and form 
x?’s with x, and x,—«, d.f. from these instead of using x’s with x, and xk, df. from the linear 
functions Yj, ...,¥,,; Y.i1)-++>Ye.ie: Which would have given quasi-independent tests for 
these Bini The loss of degrees of freedom is serious when «, — k, is small, as the F-table 
shows. Thus the method (c) gives a good result only when «,— x, is not too small. 

On the other hand, the method (6) gives a wide inequality unless x, is small. It is not 
possible, however, to make a simple quantitative statement about the relative advantages 
of the methods (6) and (c) without detailed numerical calculations. One could, however, 


follow a tentative rule for the use of these methods: 


K, small «,—x, small use (5), 
kK, small «,—x, not small use (5) or (c), 
kK, not small x,—«, small use (5) or (c), 


K, notsmall «,—«,not small use (c). 


The idea of this paper arose in the analysis of the data referred to in the text and in 
discussions with Dr Hughes Bryan and Dr B. G. Greenberg. The numerical details of the 
calculations will be published in a subsequent paper. The author wants to put on record 
his appreciation of helpful discussions with Dr Hughes Bryan and Dr B. G. Greenberg, 
of the University of North Carolina, and of help from the editorial board of Biometrika in 
improving the presentation of this paper. 


REFERENCES 


KmmBaLL, A. W. (1951). On dependent tests of qenipoen? in the analysis of variance. Ann. Math. 
Statist. 22, 600. 

Nanop1, H. K. (1951). On the analysis of variance test. Bull. Calcutta Statist. Ass. 3, 103. 

Roy, S. N. & Boss, R. C. (1953). Simultaneous confidence interval estimation. Ann. Math. Statist. 
24, 513. 

Scuerrs, H. (1953). A method for judging all contrasts in the analysis of variance. Biometrika, 40, 87. 

Tuxry, J. (1952). Allowances for various types of error rates. Unpublished invited address, Blacks- 
burg meeting of-the Institute of Mathematical Statistics, March 1952. 











[ 450 ] 


RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS 


III. SOME LARGE-SAMPLE RESULTS ON ESTIMATION AND POWER 
FOR A METHOD OF PAIRED COMPARISONS* 


By RALPH ALLAN BRADLEY 


Virginia Agricultural Experiment Station of the Virginia Polytechnic Institute 


I. INTRODUCTION 
1-1. Rank analysis of incomplete block designs 


A new method for paired comparisons was discussed in two recent papers (Bradley & Terry, 
1952; Terry, Bradley & Davis, 1952). A mathematical model was postulated and tests of 
significance were developed. The procedures were considered as special cases of a rank 
analysis of incomplete block designs, and many of the concepts may be extended from the 
limited considerations of paired comparisons to ranking in incomplete block designs with 
two or more treatments in an incomplete block. The appropriateness of the model for paired 
comparisons has been discussed by Hopkins (1954) and also by Bradley (1954a). A large 
section of tables for a test of significance on the equality of treatment effects was included 
in the first reference cited and additional tables were prepared by the present author (19546). 
For all of the tests considered in the above-referenced work, large-sample approximations to 
the sampling distributions of the statistics, under the conditions of the null hypotheses, 
have been given. 
1-2. Review of the method of paired comparisons 

It is necessary to summarize the mathematical model of the method of paired com- 
parisons in order to outline the objectives of this paper. 

We postulated true treatment ratings or parameters, 7,,...,7, for ¢ treatments in an 
experiment involving paired comparisons. It was assumed that every 7;>0, and, for 


convenience, that} 7; = 1. Further definition followed with the assumption that, when 
i 


treatment i appears with treatment j in a block, the probability that treatment i obtains 
the higher rating (or a rank of 1) is 7;/(7;+7;). Assuming independence in probability of 
treatment comparisons, we wrote the likelihood function in its generai form as 


L = [I 7# TT (7; +7;)-", (1) 
i i<j 
n 
where a; = 2n(t—1)—D’ D rip. (2) 
j k=1 


r;;, 18 the rank of treatment i in the comparison with treatment j in the kth of n repetitions 
of the paired comparisons design. 


* This project was supported by funds from the Research and Marketing Act of 1946, under Con- 
tract with the Agricultural Research Service, United States Department of Agriculture. 
+ & and II will indicate respectively sums and products with i= 1,...,t. £’ will mean that one 


i i i 
value of i that appears in the argument of the summation is omitted. I] or II represent products, 
i<j i+j 
i=1,...,4,j=1,...,t, i<j or t+j respectively. Departures from these conventions will be specified. 

















RatpH ALLAN BRADLEY 451 


The method of maximum likelihood was used to obtain estimators, p,,...,p,, of the 
parameters, 7,,...,7, and likelihood ratio tests have been used throughout discussions 
based on the model set down. A general class of tests of the null hypothesis, 


Hy: nm, =1/f¢ (¢=1,...,¢), 
against the alternative hypothesis, 
H,: m,=m(h) (h=1,...,m;¢ = 8 _,4+1,...,8), 


a 
m 
where s, = 0, s,, = ¢ and > (s,—8,_,)7(h) = 1, was formulated. These are, of course, tests 
h=1 


of the indistinguishability of treatment effects on some attribute, perhaps colour or 
flavour, of the treatments. Two special cases were considered by Bradley & Terry (1952) 
and the specialization comes under H,. 

Case (i): The hypothesis H, becomes 


H,: No 7; is assumed equal to any 7; (i +J); 
that is, in H,, m = t. 
The normal equations resulting from the use of the method of maximum likelihood 
reduced to i. 
5 OE (Pet Pt = G.= 1,...,8) (3) 


and uP, = 1. (4) 
In this paper we shall be particularly interested in the statistic* 7’ = —21nA,, where A, is 
the likelihood ratio for this special test. In this case 
T = nt(t—1)In2—2B,1In 10, (5) 
witht B, = nD log (p; +p) —Ea;log p;. (6) 
<j v 
Case (ii): H, becomes 
Hz: nz=n (¢=1.,...,8); m, =(1l—sm)/(t—s) (¢ =8+1.,...,¢). 


The general hypothesis H, is thus restricted to the case in which m = 2. The estimator p 


of 7 was given as 
n 


8s 
ns(4t—s—3)—2 } Dd’ } rizx 
i=1 j k=1 (7) 





p= — b 
ns(5st— 21? — 68 + 31) —2(28—t) SO! S rip 
i=1j k=1 
A test statistic and an approximate test procedure were set forth in the reference. It 
will not be necessary to review these methods in view of the remarks that follow in § IT. 
Abelson & Bradley (1954) considered factorial arrangements of treatments imposed on 
the paired comparisons design. We shall not consider that situation in this paper. 


1-3. Objectives 
The objectives of this paper evolve from the need of considering the behaviour of the 
developed tests of significance and estimates of population parameters when the assump- 
tions of the null hypotheses may not be true. We shall be interested in the power of the test 


* log and In indicate common and natural logarithms respectively. 
+ B, is the statistic tabulated for small samples. See Bradley & Terry (1952) and Bradley (19546). 











452 Rank analysis of incomplete block designs. III 


based on 7 in (5), in the reliability of the estimators p; of 7; defined by (2) and (3), and in 
comparisons of the power of this test procedure with those of other possible procedures. 

In the initial reference, it has been shown that 7’, in (5), under Hj, has the distribution of 
x’ with (¢ — 1) degrees of freedom for large samples. The difficulty to beexpected in attempting 
an exact evaluation of the power of the test based on 7’ for small samples, even for very 
restricted sets of alternative values of the parameters, was also discussed. Accordingly, we 
limit our objectives here to a large-sample evaluation and comparison of power functions 
and to the estimation of variances and covariances using large-sample results. 

We shall show that Case (ii) yields the ‘sign test’, the properties of which are well known, 
and we may then limit our attention to Case (i). 


II. Cask (ii) AND THE SIGN TEST 
2-1. Case (i) 

We refer to H,: 7; = 7 (i = 1, ...,8); 7, = (1—s7)/(t—s) (i = 8+1....,t) and the estimator p 
of 7 given in (7). It was not originally noted that the test procedure here reverts back to a 
binomial test. 

Comparisons of treatments in the first group of s treatments yield contributions to the 
sums of ranks in (7) of 3. Then, if X is defined to be the number of times a treatment of the 
first group ranks above (obtains rank 1) a treatment of the second group, 


y >= "ijk ag 5) + 2ns(t—s)—X. (8) 
i=1 j k=1 ” 


Substitution of (8) in (7) yields 

p = X/[(2s—t) X + ns(t—s)?*]. (9) 
Now, from the model for paired comparisons, the probability that a treatment i of the first 
group ranks above a treatment j of the second group in any of the n repetitions is 


witte cites io OP 3e) 
P(r i= 1) = , Laan 1+(t—28)n" (10) 


t—s 





(§=1,...,8,j=e+l,...,¢; &=l1,...,n). 


When we substitute the estimator p of 7 given in (9) in the form (10), we obtain the estimated 
bability, a j 
ee ene Est P(r,;,= 1) = X/[ns(t—s)]. (11) 
There are ns(t—s) comparisons of treatments of the first group with treatments of the 
second group, and it is now apparent that this special case reduces to a consideration of 
ns(t—8) binomial trials. The test procedures reduce to those of the binomial or sign test. 


2-2. The sign test 


The properties of the sign test have been thoroughly investigated. Hemelrijk (1952) notes 
that it is probably the oldest test in existence and refers to an application by Arbuthnot in 
1710, The variance and approximate normality of X are discussed in elementary texts, 
and exact power function evaluations can be obtained for small samples from tables of the 
binomial distribution or of the incomplete beta function. The power of the sign test for large 
samples is based on the approximate normality of the statistic. 











The 


ane 


| in 


| of 
ing 
ory 
we 
ns 


mn, 


rp 


da 





Rawtpu ALLAN BRADLEY 453 


Dixon & Mood (1946) showed that the relative efficiency of the sign test in comparison 
with the ¢ test under assumptions suitable for the valid application of the latter test is 2/7. 
Later, Dixon (1953) prepared small-sample tables of power efficiencies that indicated that 
the sign test compares more favourably with the t-test than indicated by the asymptotic 
value, 2/77. ’ 

In considering applications of Case (ii), it is preferable to go directly to the use of the sign 
test based on the ns(t—s) comparisons of treatments in the first group with those of the 
second group. It will usually be sufficient to estimate P(r,;,= 1) and not necessary to con- 
sider estimating 7 itself. If required, the variance of p, the estimator of 7, may be obtained 
from the variance of X in an approximation through the use of usual formulae for the 
variance of a ratio. 

We shall devote the remainder of this paper to a consideration of Case (i). 


III. Estimation 
3:1. Asymptotic distribution of (t— 1) maximum-likelihood estimators 
We shall require the large-sample distribution of the maximum-likelihood estimators, 
Py» +++» Pp» OF Case (i) and their asymptotic variances and covariances. These results will be 
of interest in themselves and useful in the development of subsequent sections of this paper. 
Some new notation will assist in this discussion. Let 
fle,m) = TE nt TT (mt), (12) 
v < 

where we use x and 7 to represent vectors, (x,, ...,z,) and (7,, ...,7,). Now x, is the number of 
times treatment i obtains a ranking of unity in a repetition of paired comparisons. If x, 
is the observation on 2; in the kth of n repetitions, the association with a; of (1) is 


n 


= Zw = 4; (t= Lesh): (13) 
k=1 


The likelihood function Z in (1) is now ll f(xq,7). It is also convenient to define 
k=1 


Pies, i, ’ 
Au vt me 1, (77;, +7) ° (0 = | | (14) 


Ay = — (4% +73)? (t+); i,j = gd 
We shall require the means, variances and covariances of 2, ...,2,. Let x,; be an indicator 


variate with the value unity if treatment i ranks above treatment j and zero otherwise. 
Then 2, = >’ 2;,;. 2; is a binomial variate with expectation 7,(7;+7,;)~', and variance 
j 


7,1;(1,+7;)-*. The variates x;; making up the sum 2; are independent in probability and 
it follows that 
E(z;) =7,>' (4%; +™)7 (t= 1,...,4 (15) 
h 
V(z;) = WT; ~ 1,(7; +7,)~? = mr; (i = 1, sees t), (16) 


and cov (%,,%;) = —m,7,(7,+7;)* = my m,Ay (t#I; 1,9 = 1,..., 8). (17) 











454 Rank analysis of incomplete block designs. III 

The parameters 7,,...,7, are not independent but subject to the restriction > 7; = 1. 
Accordingly we may regard py, ..., 2, _, 48 maximum-likelihood estimators of the inidegundeal 
parameters, 7, ...,7_,, taking 7, = 1-37, Then ./n(p,—7,), «--, V2 (1-7-1) have 


a joint limiting normal distribution with zero means subject to the verification of certain 
regularity conditions. The required regularity conditions are quite well known and given, 
for example, by Cramér (1946, § 33-3, p. 500) and Chanda (1954), who fully states the con- 
ditions but for the continuous case. We have verified the conditions for the likelihood 
function L of (1) expressed in terms of f(x, 77) of (12), but these demonstrations are omitted 
for brevity. 

The dispersion matrix of the joint limiting normal distribution of our (¢— 1) estimators 
is the matrix [A;;]~' with the definition of A;; below. We need only note that 


él . . 
th “tt nem )tt+E (mem) (i= 1,..,(t-D), (18) 
C h h 


and, from (15), that 


el) Ce) e cov (7-2) (@2-2)| =X; (4j =1,...,(t-1)), 


v 





thereby defining Aj;. It follows from (16) and (17) that 





The matrix [Aj,;] is non-negative definite since it is a dispersion matrix and it is positive 
eesti éln éln : —_ 
definite since 2x,, ...,2,_,, and hence ml, = f are free of linear restrictions. 
1 OM _-1 


The conclusion, with the established validity of the necessary conditions, is that 


y (Py — Ty). «5 2 (Py-y — M_1) have for large samples the multivariate normal distribution 
with zero means and dispersion matrix [Aj;]~". 


3-2. Variances and covariances of estimators 


In the preceding section, the parameter 7, was considered to be a function of the re- 
maining (t— 1) parameters. That process introduced asymmetry into the elements of the 
dispersion matrix, [Aj;]~?. This lack of symmetry is essentially artificial and will now be 
removed. 

The ¢ x ¢ matrix, [A;,;], is singular in view of the definitions (14) and the fact that 

7A;;+>) 1;A;; = 9. (20) 
2 
Then, if the elements of the last row and then of the last column of the matrix [A;,;] are 
subtracted from corresponding elements of the remaining rows and columns respectively, 


where [A,,—Ay]’ and [A;,—A,] are respectively column and row vectors of (¢t— 1) elements. 
it follows that ‘ - 

| Xu! j [Aj;] [Ay — Ay] 

} / 


x (21) 
." [Aj — Au] L+Ay 








Now 


colu 


whel 
agail 
fact 
The 


18) 


19) 


on 


21) 





Rated ALLAN BRADLEY 455 


Now, by reversing the process and adding the elements of the last row and then the last 
column to corresponding elements of the remaining rows and columns in (21), we have 


| Ais | Ls | A+ 1| ps | [Ayt+) [ly _| [A;;] fly | 

Pa Are. | 1] © 

where [1] and [1] are respectively row and column vectors of ¢ unit elements. The last step 

again depends on the result that | A;;| = 0. In the same way it can be shown that the co- 

factor of A;; in the extreme right-hand side of (22) is equal to the cofactor of Aj; in | Aj; |. 
Then, if o;; is the covariance of ,/n(p;—7,) and ./n(p;—7;) (i,j = 1,...,(t-1)), 


, (22) 











cofactor of A;,; in | Wal OF | 
oi; = iusto PAL, 2p Qr'y) (23) 
a [Ais] (1y' | 
|] 0 | 


But the formulation of the model for paired comparisons is symmetric in the parameters 
7, -.-, 7, and the estimators p,, ..., p,, and (23) applies for all variances and covariances with 
i,j = 1,...,¢ on the basis of its symmetry. 

The variances and covariances obtainable from (23) are simply the elements of the 
t-square principal minor of the inverse of the matrix of the determinant in the denominator 
of (23). For small values of t, this inverse matrix may be evaluated by elementary methods 
when 7, ..., 7, are specified or replaced by estimates. However, since | A;;| = 0, the usual 
Doolittle methods of matrix inversion break down. Then for larger values of ¢, it appears 
desirable to invert [A;;] specified through (19) using a Doolittle method, thus obtaining the 
required variances and covariances with i,j = |,...,(t—1). The remaining variance and 
covariances associated with ,/n (p,—7,) are determined through the relationship 


t-1 
(n(p,-—™) = — avn (p;—7;), 


t—1t-1 §—1 
so that = LL; and o,=- Yo, (i =1,...,(¢-1)). 
i=1j=1 j=1 


Since it has been established in § 3-1 that ,/n (p,—7,), ..., / (#41 — 7)_,) have asymptotic- 
ally a multivariate normal distribution, and since ,/n (p,—7;) is a linear function of those 
variates, we may now state that 


Jn (py—7), ---, n(p,—7) have, for large values of n, the singular multivariate normal 
distribution of (t—1) dimensions in a space of t dimensions with zero means and dispersion 
matria, [o;,;], defined through (23). 

In general, for large samples, we may take py, ..., p, to be jointly normally distributed 
with means 7, ...,7,, and dispersion matrix [¢;;]/n. Then any linear function } 6;p; may 

v 
be taken to be normal for large samples with mean >} 6,7; and variance 5,5; 05;/n. In 
t t<j 
particular, we may be interested in orthogonal linear comparisons of the sort often con- 
sidered in the analysis of variance. 

Estimated variances and covariances will usually be required. Through the consistency 
of maximum-likelihood estimators, we can define x iz (t,j = 1,...,t) to be the same functions 
of p,, ...,p, a8 A,; are of 7, ...,7, a8 defined in (14). Then [¢,;], the sample dispersion matrix, 
is the same function of a as [o,;] is of A,; in definition (23). 


3° . Biom. 42 








456 Rank analysis of incomplete block designs. III 


It nas been shown ‘Bradley, 1953) that the functions of the parameters, In7y, ..., lnm, 
determine in a sense location points for the ¢ treatments on an arithmetic scale. Conse- 
quently, there may be some interest in obtaining 


o;;(In p;, In p;) * nT ;T; (0,9 1, tees t), (24) 


as given, for example, by Hald (1952). If common logarithms are used, a factor, (0-4343)?, 
is required in the right-hand member of (24). 


IV. LARGE-SAMPLE DISTRIBUTIONS OF 7' 


4-1. T expanded in a series 


It has been shown (Bradley & Terry, 1952) that, under the conditions of the null hypo- 
thesis of Case (i) of § 1-2, 7’ as defined in (5) has, asymptotically with n, the x?-distribution 
with (t—1) degrees of freedom. In order to investigate the power of the test procedure 
based on 7’ and to consider the efficiency of the method in comparison with other test pro- 
cedures, we require the distribution of 7 under the alternative hypothesis H, of Case (i). 
In this section we shall show that 7’ has the same limiting distribution as a certain sum 
of squares. 


Let n= (p.- 7) (= | Pere 3 (25) 

and note that from (2) a; = §n(l+y,) >’ [1 +4y,4+y,)]7 (26) 
j 

for all i, and Ya; = snt(t—1). (27) 


Substitution in (5) through the use of (6) and upon reduction based on (26) and (27) yields 


T=nY(i+y)(n(l+y)) D’1+dyt+y)1?- 2m 2 In(1 +3(y,+y;)). (28) 
i 7 i< 


It is next possible, by expanding the right-hand member of (28) in a power series in the 
variates, to show that 
T = jnt Syit Rly;), (29) 


where R(y;) depends on higher powers of the variates than the second. In turn, it can be 
shown that 


| Rty;)| < zy(5t — 8) | ys |® + 3m 9t — 14) > yj + gntt— 2)z Yi [+i lyity;|* 
i i i<j 
* , (Y¥ity ? 
+n il +4y3 +o y; |? +4y4 —t<f"___. (30 
Uyl+witslnP+ WIS gg sy yy 39) 
This result, (30), is obtained through algebraic manipulation involving the use of the general 
Pac | In (1 +2)—a+ $a?|<|4a|§, (1+)? = 1-x+22/(1+2), 


and the particular relations 


L’'y,=—-y, and 2> yy; =-— Ly. 
j i<j i 











3), 


1€ 


9) 











RALPH ALLAN BRADLEY 457 


Let us suppose that ./ny; (? = 1, ...,¢) have limiting distributions and this will be demon- 
strated under desired conditions. Assuming this result, we may then state that n}-ty,, for 
any € > 0, converges in probability to zero, for it is an easily proved theorem in probability 
that, if (Xj) represents a sequence of random variables with a limiting distribution function 
and if («,.) is a sequence of constants approaching zero as VN > 00, then (a, X ,) is asequence 
of random variables converging in probability to zero as N >oo. It follows from Slutsky’s 
theorem (Cramér, 1946, § 20-6, p. 255) that, if ./ny; have limiting distributions, R(y,) 
converges stochastically to zero. This is sufficient (Cramér, 1946, § 20-6, p. 254) to state that 
T has the same limiting distribution as int ¥ y?. 

i 


4-2. Limiting distribution of y; under H, 
The normal equations for the maximum-likelihood estimators were given in (3) and (4), 
and these relations are useful in obtaining the limiting distribution of ,/n y; defined in (25) 
under modified specifications of 7,, ...,7,. From (3) or (26), we have 


a;/n = HI +9) SLL + dy +y)]"* (¢=1,...,t). (31) 


Expansion in (31), using the negative binomial and the result that x Y; = —Yy;, and multi- 

plication by ,/n lets us write 

a; (t— 2) L_sltyWnytvaur ¢ 
t_ a0e_ 1) 1 ee hd Maes ele ee ee eet 450 32 

Now, from an argument similar to that used in § 4-1, if ./ny, has a limiting distribution, it 

has the same limiting distribution as 


yn 





4,/nia : 
; i ue). (33) 
Let us now redefine the parameters so that 
7; a. (i = 1,...,8), (34) 


t jn 

where 4;,, represents a sequence of constants converging to 6; as n->0o. This redefinition of 
the parameters is an artifice that permits us to find the limiting distribution of 7 under H, 
and thence the power of the test procedure. This essentially means that we are investigating 
the asymptotic power for alternatives in the locality of the parameter point determined by 
the null hypothesis. This is necessary in order that we do not merely show that the test 
procedure is consistent. Under these conditions we require the limiting distribution of the 
variate defined in (33). 

Let %,; be the average over the n repetitions of the binomial variate, x;;, defined in § 3-1. 
Then, through the definitions of x; and x,y and (13), we have 


a; = n>)’ Xj, (35) 
7 
and the variates, Z;,,...,%,, omitting Z,;, are independent. 
Since fei ;) =7,(1%,+7;)—1, we may define ¢,(7) to be the characteristic function of 
2,/n (%;;—4), aust, following the method of Cramér (1946) in § 16-4 of his book, we have 





(36) 





fen -/-1,| 1; cexp|y—1 5 \)" 
Palt) = (7; +75) (7; +75) 


30-2 











458 Rank analysis of incomplete block designs. 111 
f£xpansion of the exponentials in (36) yields 


7;,—;\ | lia y®—7,0\ 7* |" 
nc (= q bg Bian inds frie ); 37 
Pa(7) “ 1, +7; v Jn Int 7, +7; ) 6nd (87) 
where y and @ are real or complex quantities less than unity in modulus. 
We now substitute from (34) for the parameters in (37) and write 





2 n 
by(t) = [1+y-150 (6;- eacinke 


t = 
where H= va 1% 0,540) bly +6j)-J—1z , Jn T (a, — fly, ! Bin pia: ? 
. 6:,7% 8, ¥*) [2 oi +4;, 
P 7 3 “in? int ine 
and W= sali (63 — y®) + My, Ue v||F+ 





t Vn 
From the convergence of 6;, to 6; and the forms of H and W, for any e¢, 7 >, there exists 
No(€, 7) such that, when n>, | H | <e and | W| <y for fixed r. It follows that 


exp[—477(1+7)+4, — ltr (d;—6;) —¢]<lim ¢,(7) 


<exp[—47?(1—7)+4,/— ltr (d;-—46;) +]. 


and hence lim @,(7) = exp| — 477+ $y — ltr(8;—94;)]. (38) 
This result (38) is sufficient (see, for example, Cramer, 1946, § 10-4) to state that 2 \/n (;; - }) 
has a normal limiting distribution with mean 4¢(4;—4;) and unit variance. 


We may obtain the variances and covariances of 4,/n[a,/n—4(t—1)]/t through use of 


the relation (35). It is easy to show that 


4/n Bn be 
wn lee Me- | = =7D'[2Vn@-4H)). (39) 
7 
and we note also that by definition %;; = 1—%;;. (40) 


The mean of the left-hand member of (39) in the limit is now seen to be }’ (6;—4;) = t;, 
j 
since >)’ 6; = —4;. Through the variance of 2 ,/n (%,; — 4) and the independence of the variates 
j 

in the sum in (39), the limiting variance of the left-hand member of (39) is 4(¢—1)/t?. %,; and 
Z,, are independent unless i = g and j = h or i = h andj = g. Then in view of (40), the co- 
variance of 4,/n[a,/n —}(t—1)]/t, and 4,/n[a,/n — }(t—1)]/t (+9) is —4/t®. It follows that 
4,/n[a,/n—4$(t—1)]/t, ..., 4./n[a,—4(t—1)]/t, and consequently /ny,, ..., ny, have a joint 
limiting normal distribution with means té6,, ..., td,, and equal variances and covariances 
4(t—1)/t? and — 4/t*, respectively under the conditions (34) on the parameters as n-> oo. Since 
xy; = 0, the limiting distribution is necessarily singular. 


; It is to be noted that this same limiting distribution can apparently be obtained by using 

‘the definition (34) for the parameters and relying on the use of the joint limiting distribution 

of ./n(p;—7), -.-,./%(p,—m,) developed in §3-2 above. If 7; is replaced by ¢-!+n-6,, in 

the definitions of A;; in (14), and if A,, is then replaced by lim A,, in (23), the correct variances 
n—->@ 


and covariances would be obtained for ,/n y,, ...,./n y,. That this procedure is valid has been 
proved by Wald (1943) but we have obtained the results directly. 











37) 


ists 











RALPH ALLAN BRADLEY 459 


4:3. Limitina distribution of T under H,: 7; = t-!+n-*6,,. 
We define U, = tat/ny;, (41) 
and, from § 4-1, 7’ has the same limiting distribution as ¥ u3. w,, ..., u, have a joint limiting 
i 


distribution that is multivariate normal with means, 4186,, 3 4126, variances (t—1)/t 
and covariances —1/t. Further, > u; = 0. 
t 


The Helmert transformation is used to transform to new variables, z,,...,z,, with z, = 0, 
and hence 7' has the same limiting distribution as 524. It is easily verified that z,,...,2_, 
are independent, have unit rare) and have limiting normal distributions under H, 
with means, say, a »+++,9,4, Where zet= toi. This is sufficient to permit the final 
conclusion that Dr and T have a limiting distribution under H,: 1, = t-1+n-*6,, as n->00, 


that is, the non-central y?-distribution with (t—1) degrees of freedom and parameter of non- 
centrality, A = }3 ¥ 43. A is defined if we reiterate that the limiting distribution function 
i 


of 7 under H, is obtained by integrating the non-central y?-density, 
e-3T e-tA © T-)+h-1jh 





IP) = a 2, Pe 1) + AI (42) 
The power of the paired comparisons test based on 7' is asymptotically given by 
BIA| a, t—1, 00) = | f(T) a, (43) 


i-1 
where y2 ;_; is the a-level significance value of a central y?-distribution with (t— 1) degrees 
of freedom. Fix (1949) provided tables for « = 0-01 and « = 0-05 with A tabulated for 
given values of # and for given degrees of freedom. Pearson & Hartley (1951) provided 
charts for given degrees of freedom and a = 0-01 and 0-05, whereon / is plotted against ¢. 
These charts are for the non-central F-distribution but may be used for the non-central 
x?-distribution if the denominator degrees of freedom of F are taken to be infinite. The 
notation A(A | a, v1, ¥,) was used by Pearson & Hartley and has been modified for our special 


case in (43). $= (alt = HE 83. (44) 


Tang (1938) presented tables based on ¢ that may also be used to evaluate the integral (43). 
For large n, an adequate approximation to the power of the paired comparisons test may 
be obtained by assuming the non-central y?-distribution and taking the parameter A to be 


A= Jnt? ¥ (m,— 1/0". (45) 


When A, is true, 7, = 1/tand é;,, = 6; = 0(¢ = 1,...,¢). Accordingly, A = 0 and 7 has then 
the central y*-distribution with (¢—1) degrees of freedom for large n. This simply re- 
establishes the approximate large-sample distribution for the test of H, given by Bradley 
& Terry (1952). 











460 Rank analysis of incomplete block designs. II1 


V. COMPARISONS OF POWERS OF TESTS 


5-1. Comparison of the test for paired comparisons with a multi-binomial test 
One of the chief uses of an investigation of the power of a statistical test is an assessment 
of the merits of the test in contrast with alternative procedures. In this and the following 
section we shall compare the test based on 7' with two other possible methods. 

The model formulated for the method of paired comparisons is not the most general one 
of its form possible.We could postulate parameters 7,;, the probability of treatment i being 
rated above treatment j in the comparison of these two treatments. Then the n repetitions 
of the paired comparisons design could be regarded as $¢(¢—1) unrelated binomial tests, 
each test depending on » trials. 

Let us consider the comparison of treatment i with treatment j. The usual test of the null 
hypothesis, 7;; = 4, is based on the statistic, 


Si; = 4n(p,;— 4)? (46) 
where p;; is the estimator of 7,;. Further, 
S,; = 2n( py; — 4)? + 2n(p;,— 3)? = u2;+ us; 
and u;; and u;,; correspond to variates w,; defined in (41). This binomial test is only a special 
case, t = 2. of the paired comparisons method being discussed in this paper. Then, under 


the null hypothesis, S;; has a limiting y?-distribution with 1 degree of freedom. (This result 
is of course well known.) Further, if the alternative hypothesis is expressed as 

Mig = $+ Mealy 
with “;;, converging to “;; as noo, S;; has a limiting non-central x?-distribution with 
1 degree of freedom and parameter of non-centrality, A,; = 2(4?;+,/5,), m; = 1—7;; and 
i; = — jy; The distribution ot S;; under the alternative hypothesis follows as a special 
case of the theory developed for paired comparisons. 


Suppose that the model for paired comparisons is appropriate. Then, 7; = 7,/(7;+7;) 
and, if 7, = t-1+n-é,, as before, 


12 ‘ AB (1 3; pe f 1 
/ ign ge }t(6;,, 7 8 jn) - 4 d n 8;,(8;, 2 ¢ 9 jn) Rag 8 Vn t re d ") (Gin e 8;,)° | 1+ » 4 n (9;,, +8, “ 
It follows at once that {iz = }U(d;—4;), (47) 
and Ai; = }°(6;—4;)?. (48) 


If an overall test of H,. the equality of all treatment parameters, here called the multi- 
binomial test, were made on the basis of the 4t(t— 1) independent sets of binomial trials, 
the appropriate statistic would be S=DS,. (49) 

i<j 
From the foregoing argument and the additive property of the x?-distribution, under A). 
S has the central y?-distribution with 4¢(t— 1) degrees of freedom and, under H,, S has the 
non-central x?-distribution with the same degrees of freedom and parameter of non- 


centrality, A = HE (6-6)? = HDF =A, (50) 
i<j i 


in the limit with n. The simplification in (50) is possible since 5 6; = 0. 
i 





——— 
—_ 





of 


st 





ent 
ing 


one 
ing 
ons 
sts, 


ull 


46) 


‘ial 
ler 
ult: 








 — —— 





RALPH ALLAN BRADLEY 461 


The paired comparisons test and the multi-binomial test under the conditions appropriate 
to the former have asymptotic power functions that differ only in the numbers of degrees 
of freedom that are available. The asymptotic powers of the two tests are clearly different. 
In Figs. 1 and 2 we have plotted the power / against A for t = 3 and t = 4 for both tests 
based on both 0-05 and 0-01 levels of significance. The /-scale is logarithmic and the power 
curves were obtained from similar charts given by Pearson & Hartley (1951). The horizontal 
scales on our charts are for A, the element common to the parameters of the test considered. 
To use the Pearson & Hartley charts we computed ¢ = ,/(A/t) for the paired comparisons 
test as given in (44) and we required 

: A 
= lar (51) 
for the multi-binomial test. It is to be noted that, when A = 0, # = 0-05 or 0-01 for both 
tests. Further, as one would expect, f—1 as A~oo for both tests. In the intervening 
A-region, the 7'-test is more powerful than the multi-binomial test, and there is an indica- 
tion that the advantage of the paired comparisons test increases sharply with ¢, the number 
of treatments. 


Power curves for an analysis-of-variance test are also given in Figs. 1 and 2. The com- 
parison of paired comparisons with analysis of variance is discussed in the next section. 


5:2. Comparison of the test for paired comparisons with the analysis of variance 


A correspondence between the parameters of the method of paired comparisons and those 
of analysis of variance may be established by referring to the discussion of Thurstone’s 
model for paired comparisons given by Mosteller (1951). Mosteller summarizes the Thur- 
stone model by writing* ‘gfe 

; ) = ——— —ty* p 

ries * #9) aa (i-Splen ~ sh 
where S; is the location point for the variate X; on a subjective continuum and a7 is the 
variance of the difference (X;—X,). It was assumed that the variance of X; is 7? and the 
covariance of X; and X, is po*. Then 04 = 20%(1—p). If X; and X; can be observed in paired 
comparisons, the usual additive model of the analysis of variance would require that p = 0, 
S,; be the effect of treatment 7, and a? be the common variance of the random elements of 
that additive model. For our purposes, the Thurstone model may more generally be 
associated with any analysis of variance with the same variance o? for the random element 
of its additive model and with S; as the effect of treatment 7. 

The present author (1953) has shown that, for the method of paired comparisons here 
discussed, one can write 


1; 1 - 


- 1 
PUK,> Xj) = 5 = | * veecingy 80h VY = 3a | sech* 4 dy. (53) 


—a(in m4— In 75) 
A correspondence between S; and 7; is required to compare the method of paired com- 
parisons with the analysis of variance. That the forms of the integrand in (52) and (53) do 
not matter for large-sample sizes can be demonstrated. 


* Mosteller indicated in his discussion that no generality was lost by taking o,= 1 and he wrote 
(52) with that restriction. 








462 


Consider 


The variance of an estimate c, ot c, obtained from an estimate p of P(X, > X;), can for large 


samples be written* 


Rank analysis of incomplete block designs. III 


P(X,>X,) = | fly) dy. 
~e 


V(c,) = = P(X,> X,){1-P(X,> X)}{f(—o)}-* 












































































































































































































































0-99 a —1—> 
a - wae 
a - 7s ge” Fe ge 
F ae s TL rat 
4 - > 
/ y , 
/ 
0-90 Z A LZ ; . rl 
/ / VA rs iF of 
ie y, LZ LY 
7 
0-80 / J 44 P's 
2 ry 4 ey 4 
r [ } L of 4 af 
= o. [ , Fi a fey 
2 070 / 7 
[ y, L A ae 
0-60 aA Fd ie * of Tests 
EES. 2 a T, paired comparisons 
La "4 PA ,. 7 ® P P -__ 
0-50 77 4 a S, multi-binomial oe 
0-40 IZA oi i F, analysis of variance ‘de 
o30f@ A —_ ee : ee 
020 Ea SSS 
010 I I I 
1 5 9 13 A 17 21 25 29 
Fig. 1. Asymptotic power functions: t= 3. 
0-99 - a 
a -r 
‘a 
F/\ Fs LT om cage Oe 
y of ba * ij Mr od 
2 7 
/ ¢ P ag 
/ / AoA Pig 
0-90 4 4 co} 
/ va ae oe 
L iL. 6 
y 
/ / Va “ p” 
RQ 080 7 a 7 }* 4 
5 / vi af 
3 7 
a. 0-70 [ f- c 7 
[ Fs! # y # 
660 7 4 7 Pr Tests EA 
By La! A T, paired comparisons & 
, Y 
0-50 f ae Wa Z an S, multi-binomial heed 
Z Z 
iL\ A Bs fm F, analysis of variance se 
0-40 4 ” a 
pe ee “+ Je d= 005 --=-a=001 [4 
ll ca j r < r x 
oa er 
0-10 me 3 I I I Tt I 
1 5 9 13 17 21 25 29 
A 


* For a derivation when f(y) is normal, see Cochran (1937). The result is used by Dixon & Mood 


Fig. 2. Asymptotic power functions: t= 4. 


(1946) in discussing the stat stical sign test. 


ar 


in 


arge 


[ood 


RALPH ALLAN BRADLEY 463 


V(c,) does not depend on the form of f(y) but only on the ordinate f(—c). Accordingly, we 
establish a correspondence between S; and 7; when we adjust the scale factor, a, in (53) 
so that the ordinate is equal to 1/,/(27). The ordinate in both (52) and (53) is evaluated at 
y = Osince Inz,—Inz, = O(n-+) because of the definition, 7; = t+ -*6,,,. It follows that 

a = \($7) (54) 
and 4/(477) n 7, corresponds to S;/o4. 

If in the Thurstone model we assume that p = 0 as required for the association with 
analysis of variance, ,/2S;/7, represents the location point (or treatment effect) for the 
ith treatment on a subjective continuum (or additive scale) whereon the variance of an 
observation is unity. In terms of the parameters of the method of paired comparisons, the 
location point is ,/(}7)lnz,. If n(t—1) observations are made on each variate X; in the 
analysis of variance (this corresponds to the number of times a treatment appears in the 
method of paired comparisons), the parameter of non-centrality for the F-test would be 
approximately 


2 


1 
A, = mt (In ;>inm) ‘ (55) 
4 t 
For large samples the non-central F-distribution approaches the distribution of the non- 


central yx? with (¢— 1) degrees of freedom and A ;,> 2”, where 
A" = }nt®(t— 1) 563 = a(t— 1) Aft, (56) 
in view of (34) and (50). 

The power functions for analysis of variance are plotted in Figs. 1 and 2. To plot these 
curves, we computed ¢” = ,/[7(¢— 1) A/é?] in order to use the Pearson & Hartley charts. 
It is clear that analysis of variance is superior to paired comparisons and the multi-binomial 
procedure for t = 3 and 4 and the advantage increases with t. On the other hand, for com- 
parative purpose, we have assumed that conditions for the valid application of analysis of 
variance could be attained while the method of paired comparisons was devised for situa- 
tions where the analysis of variance may not be used. 


5:3. Relative efficiencies 

The relative efficiency of one test to another may be obtained from the limiting ratio of 
sample sizes required for equal powers. The concept depends on local properties of the power 
functions of the two tests being compared. Relative efficiency is the limit of the inverse 
ratio of sample sizes required in two tests in order that the tests have equal powers for 
alternatives approaching, with increasing sample sizes, parameter values specified by the 
null hypothesis. The methods employed in this section parallel those developed by Pitman 
(1948). We shall first obtain the asymptotic relative efficiency of paired comparisons to 
analysis of variance by elementary means. Then, to complete the inter-comparisons, we 
use a more general theorem of Pitman, as presented and extended by Noether (1955), to 
obtain the relative efficiencies of the multi-binomial procedure to paired comparisons and 
to the analysis of variance. 

Let »” be the same size for the analysis of variance and n the sample size for paired 
comparisons. For comparable treatment differences with the ratio n"/n fixed, we need 
é;,. corresponding to 4;,, of (34) such that 


lim {3% ee | th oe hy (57) 


nese | 











464 Rank analysis of incomplete block designs. [11 
Now we have already seen (§ 4-3) that for the method of paired comparisons, 
A = 4} ¥ 43 = lim jné ¥ 83,/n, 


n—>@ t 
and we can write, from (56), 
X" = lim }an"t*(t— 1) 5 872./n". 

n*—>o i 
The test situations are comparable when (57) holds and the asymptotic powers are identical 
when A = A", Consequently the ratio of sample sizes for equal powers is asymptotically 

lim me pt ull 
nwo mMt—1)’ 


(58) 


and this is the asymptotic relative efficiency of the method of paired comparisons to the 
analysis of variance. 

The theorems of Pitman and Noether yield relative efficiencies in terms of the efficacies 
of the tests being compared. For a statistic Ty based on a sample of size NV, define 


E(Ty) = Wy(@;Ty) and var(Ty) = oX(9;Ty). 
The efficacy of the test of H,: 0 = @, against the alternative H: 6> 4, is 
1/ RN? (Oo; Ty) = {Y™(O9; Tx)/oN(9q; Ty)¥™ (59) 


where 0, = 6,+4/N®. It is assumed that a particular alternative 0 = 6, changes with the 
sample size N in such a way that lim @y = 4). 


N>o 


We have three tests of the hypothesis H,: 7; = 1/t with the alternative, 


H,: 7m, =t+n-*é,,, 3; as nm->co (t= 1,...,¢). 
a ? in 


. 
The statistics, 7), for the method of paired comparisons, S,, for the multi-binomial procedure, 


and F, for the analysis of variance, all have asymptotically non-central y?-distributions 
with parameters, 


A = lim }né86,. A’ = lim }nf80,, A” = lim }nzf*(t—1)4,,. (60) 


n> @ a2 a—>@ 
and degrees of freedom, (¢— 1), 4¢(¢— 1), (¢— 1) respectively. In (60) 


> 92 2 
G., = = Sin n. (61) 


‘ 


The subscripts were added to the statistics T, S and F inorder to provide a notation consistent 
with the general definition (59) and to emphasize the dependence of the statistics on n. 

The hypothesis H, is true when 6, = @,= 0. We may take y,(@; T,), w,(@: S,) and 
y,(@; #,), corresponding to yy(8; Ty) in (59), respectively to be* 


(¢-—1)+ nO, $(¢-—1)+}n8O and (¢—1)+ }nr(t-1)8. 
Differentiation with respect to @ yields (for all values of @ including @ = 0) 
¥,(0; T,) = gn Yi(0;S8,)= fn and (0; F,) = }nmi(t—1). (62) 
* Our values of y, and 2, the latter given in (63), are not the means and variances of the statistics 


for finite » but rather were calculated from the limiting distributions. That this is permissible under 
conditions which may be verified for our tests was noted by Noether. 








Ire 


Th 


of 
mi 


ar 





cal 


re, 
ns 


2) 


ics 
er 











RALPH ALLAN BRADLEY 465 
From the non-central y?-distributions, 
o7(0; 7) = 2t—1), o2(0;8,)=tt-1) and o%(0; F,) = 2(t-1). (63) 


The efficacies of the three tests are. with 6 = 1, m = 1 in (59) in each case, 
nt nt? t t—1 
» T= . R,(0; 8,) = 0: = nn? : 
(0; 7. ):= V[32(t— 1)] 2, (0; S,) ‘ Je and R,(0: F,) = nat / 39 (64) 


The relative efficiencies of the three tests taken in pairs are given by the limits of the ratios 
of corresponding values of R,, subject to conditions set forth in the reference and which 
may be verified here. The relative efficiencies of S to 7’ and to F are 


‘ — R (0; S 2S 
2.E. (S to 7') = hot mt . =(; ) (65) 

' = Vin PnlO: Sa) _ (28)! . 
and R.E. (S to F) = i R,(0; F ae hh (66) 


Table 1. Asymptotic relative efficiencies 


(7', paired comparisons; S, multi-binomial; F, analysis of variance) 








t 2 3 4 5 6 7 8 9 10 xc 
TtoF 63-7 47-7 42-4 39-8 38-2 37:1 36-4 35°38 35-4 31-8 
Sto 7 100-0 81-7 70-7 63-2 57°7 53-5 50-0 47-1 7 0 
Sto F 63-7 39-0 29-9 25-1 22-0 19-7 18-1 16-9 15-9 0 









































The resuit obtained more directly for the relative efficiency of T' to F may also be obtained 
using values given in (64). Asymptotic relative etticiencies expressed as percentages for 
values of ¢ from 2 to 10 are given in Table 1. 

When t = 2, the method of paired comparisons reduced to the statistical sign test and the 
relative efficiency (58) is 2/7 as obtained by Dixon & Mood. The method of paired compari- 
sons becomes more and more inefficient relative to the analysis of variance as ¢ increases. 
This was to be expected in view of the work of Dixon & Mood for the sign test, and since the 
comparison was made for a situation in which the analysis of variance could appropriately 
be used. The assumptions of the analysis of variance are invalidated in much of the work on 
sensory difference and subjective testing, and it was to avoid those restrictive assumptions 
that the method of paired comparisons was developed. Further, although for ¢t = 2, the 
tabled relative efficiency of 7 to F is only 63-7 %, Walsh (1946) has shown that, for samples 
of size n = 4, 5 and 6, the sign test is <eniadimaiiedl 5%, as efficient as the ‘Student’ t-test, 
and, although the relative efficiency decreases as n increases, it is approximately 75°, when 

= 13. Dixon & Mood (1946) and Dixon (1953) also indicate that the efficiency of the sign 
test is better for smaller sample sizes. It seems safe to assume that this situation also holds 
for {> 2 in paired comparisons. 

The multi-binomial procedure is seen to have rapidly decreasing relative efficiencies as 
t increases in comparison with both the method of paired comparisons and the analysis 
of variance. 











466 Rank analysis of incomplete block designs. III 


VI. ILLUSTRATIVE EXAMPLES 


6-1. Estimated variances and covariances in a numerical example 


We shall use the data from the taste-testing experiment on the flavour characteristics of 
pork roasts as given as an example by Bradley & Terry (1952). We nowassume that the data 
for the two judges may be pooled. The pork roasts were obtained from hogs fed corn rations 
and peanut supplements C, 'p and CP. The experiment, with the results of the two judges 
pooled, yielded the set of sums of ranks, 32, 28 and 30, respectively, for the rations while 
n = 10. The use of the tables in that paper showed the following values: 


rr, <re <rs Pi Peo Ds Prob. 
32 28 30 0-24 0-43 0-32 0-63 


This result is clearly non-significant at any realistic level of significance and is not indicative 
of any real ration effect on the flavours of the roasts. 

We now obtain estimates of the variances and covariances of ,/n p,, ,/n p, and ./n p, or 
/n(p,—7), (n(P_g—7) and ./n(pg—73). The first computing step is to obtain values 
of Ay from substitution of values of p; for 7; in (14). Then, 


Ai = 8243, Agg= 2-566, 
Ars = — 2228, Ay, = — 1-778, 
Ais = —3°189, Aggy = 4-780. 


The estimate-of the determinant in the denominator of (23) is 
8-243 —2-228 -—3-189 1 

— 2228 2-566 -—1-778 1 

—3:189 —1-778 4-780 1 

] 1 1 0 





| = — 154-976. 
| 


Now, for example, from (23), 


—2-228 -—1-778 1 
— 3-189 4-780 1 
l 1 0 


1 


154-976 | — 


A 
T.2= 


and similarly, the complete set of estimated variances and covariances is 


G1, = 0:0703, Gog = 0-1252, 

Oy. = —0°0485, G3 = —0-0767, 

G13 = —0-0218, G3 = 0-0985. 
The estimated vaziances and covariances of p,, p, and p, may be obtained by dividing 
those immediately above by n = 10. Consequently, the estimated standard errors of 9), 


p, and p, are respectively given approximately by 0-084, 0-112 and 0-099. 
A check on the computing may be made by computing the variance of ./n ¥ p; in terms 
i 


of the variances and covariances of the elements of this sum. The result of course should 
be zero. 








pol 











RateH ALLAN BRADLEY 467 


6-2. Use of the power function 
(i) Suppose that the true values of the parameters 7, 7, and 77, in the example of the 
preceding section are respectively 0-28, 0-59 and 0-13. (If these were estimates with n = 10, 
the significance level of the test would be 0-04.) We use 6;, defined in (34) as an approxi- 
mation to 6; and from (45) and (44) we have the approximate values 


A=7-43 and ¢ = 1-57. 


The approximate power of a 0-05-level test, obtained from Tang’s tables* with f, = 2, 
fz = ©, & = 0-05, is B(7-43 | 0-05, 2,00) = 0-67 and with a = 0-01, (7-43 | 0-01, 2,00) = 0-56 
in the notation of (43). If we are interested in the probability of failing to reject H, when 
H, is true, we require | —/. For a = 0-05, the probability of a type II error is 0-33 and, for 
a = 0-01, the probability of a type II error is 0-44. 

(ii) Suppose that the differences observed in the estimates for the pork roast experiment 
are of practical importance (as distinct from statistical significance) and that for a 0-05-level 
test we desire the sample size for a second follow-up experiment such that # = 0-95 or such 
that the probability of a type II error will be about 0-05. Without specifying , we take the 
values of 71,, 7, and 77, to be 0-24, 0-43 and 0:32 as estimated in the experiment. Then from 
(45), A = 0-123n and ¢ = ,/(0-123n/3). 

To determine n, we enter Tang’s tables with f, = 2, f, = 00, a = 0-05 and find that ¢ must 
have the value 2-35 for £(0-123n | 0-05, 2,00) = 0-95. The number of repetitions required 
is obtained by setting 0123/3 = (2°35)%, 
from whence n = 135. 


This is of course a very large number of repetitions and is required owing to the small 
differences among the estimates of the experiment. 
If we take the hypothetical case considered in (i) above and the same test requirements, 


0-743n/3 = (2°35)? and n = 22. 


This value may again seem high but it must be emphasized that most experiments are 
conducted without regard to the powers of the tests employed and this is true even when the 
analysis of variance is used. 


VIL. Discussion AND SUMMARY 
7-1. Additional test procedures 
Two test procedures relating to paired comparisons were discussed by Bradley & Terry 
(1952) and have not yet been considered in this paper. 
The first test is a combined test on the equality of treatment effects with the specitication 
of the alternative hypothesis such that the parameters may differ from one group of repeti- 
tions to another. Suppose the paired comparisons experiment is performed in g groups of 


g 
repetitions with n,, repetitions in the wth group, uM = n, and suppose that the treatments 
1= 


have parameters, 7,,, ---; Ty» 0 7, = 1, in the uth group. The hypotheses are 
i 


Ay: my = Aft (¢=1,...,¢ ¢ = 1,...;9) 


* Tang (1938) tabulated P,,, the probability of a type [I error. In using the tables, linear inter- 
polation was judged to be satisfactory for our examples. 











468 Rank analysis of incomplete block designs. III 


and H,: No 7;,, assumed equal to any other 7, (¢+j or u+v; i,j = 1,...,t; u,v = 1,...,g). 
Let 7, be the statistic corresponding to T' for the wth group. Then 


g 
T. = XT, 
u=1 
is the appropriate statistic for this test. From the additive property of ?, it follows that 
under H,, T. has, for large n and fixed ratios n,,/n, the central y?-distribution with g(t— 1) 
degrees of freedom and, under H,, 7. has the non-central x?-distribution with g(t—1) 
degrees of freedom and parameter of non-centrality approximated by 


A. _ 8 p> s (mn i) 
iu=1 t 
The second test is a test of agreement of ranking from group to group. The hypotheses 
are Hy: 7;, = 7; (i = 1,...,t; w= 1,....g) and H, as specified in the preceding paragraph. 
Let T, be the 7'-statistic obtained by pooling all groups of repetitions as though we have a 
homogeneous set of n repetitions. The appropriate statistic for this second test is (T,—T,). 
This statistic has the central y?-distribution with (g— 1) (t— 1) degrees of freedom under A). 
Under H,, we state without proof that the distribution of (7), — T,) is the non-central x?-dis- 
tribution with (g—1)(t—1) degrees of freedom and parameter of non-centrality approxi- 
mated by d 
A= ye pp > Ny (Ty - 77). 
tu= 
7-2. Discussion 
The results obtained in this paper are valid only for large samples. For small samples 
they may be approximately correct, but the difficulty of obtaining exact powers makes 
a check very nearly impossible. Under H, some information is available on the approach 
of the distribution of 7’ to the central y?-distribution. The means and variances of 7’ were 
computed from the exact small-sample distribution of T (Bradley, 19546) and their approach 
to the corresponding moments of x? noted. It appeared that the approximate tests would 
be adequate at least for n > 10. 
In extreme cases, the estimators p,,...,p, sometimes have the set of values 1, 0,..., 0. 


This prohibits the computation of des and the estimation of variances and covariances of 
the estimators. This situation was previously discussed (Bradley, 19546), and it was noted 
that the first treatment could be eliminated from the analysis and that secondary para- 
meter estimates could be obtained for 7, ...,7,. This procedure again seems suitable when 
the specified set of values occurs. Then it will be possible to proceed with the analysis and 
the estimation of variances and covariances for the remaining estimates. Other extreme 
sets of sums of ranks yield extreme sets of estimators and methods of dealing with them 
are discussed in the cited reference. 

The comparisons of the properties of the method of paired comparisons and of other test 
methods produce results that seem to be about what one would expect. The method of 
paired comparisons is clearly superior to the multi-binomial procedure both on the basis 
of the asymptotic powers of the tests for equal sample sizes and the study of their relative 
efficiency. It appears that the superiority of paired comparisons will become more marked 
as the number of treatments is increased. The comparison of the method of paired com- 
parisons with the analysis of variance is less favourable to the method of paired comparisons. 











is 


at 
1) 











Ratpeyu ALLAN BRADLEY 469 


Here the comparison was made when the analysis of variance would be the appropriate 
method. It should be remembered that the method of paired comparisons was formulated 
largely for subjective tests and it is there that the assumptions of the analysis of variance 
are seriously suspect. Further, even if a scoring technique could be devised for which the 
analysis of variance could be used, it is quite possible that an over-all consideration of time 
of judging and analysis would dictate the use of more samples and the non-parametric 
method. The large-sample results indicate that for the usual numbers of treatments 24 times 
as many samples are required for the method of paired comparisons as for the analysis of 
variance. It is conceivable in a tasting experiment, for example, that an individual can 
indicate rankings in a paired-treatment experiment as contrasted to scoring individual 
samples on an 11-point scale, say, with much less taste fatigue and with a speed ratio in 
favour of ranking in excess of the ratio of sample numbers. 


7-3. Summary 

In this paper we have examined some of the properties of the method of paired com- 
parisons. The results obtained are asymptotically correct for large-sample sizes or for large 
numbers of repetitions of the paired-comparisons design. Formulae for the variances and 
covariances of estimates of treatment ratings, 7,,...,7,, have been obtained, and these 
were not heretofore available. It has been shown that statistics previously used for tests of 
significance have limiting non-central x?-distributions, and the appropriate parameters 
of non-centrality are given. 

It was found that in comparison with the analysis of variance the relative efficiency of 
the method of paired comparisons is ¢/{7(t—1)}, and, when ¢ = 2, this has the value 2/7 
previously obtained in comparing the sign test with the ‘Student’ t-test. The method of 
paired comparisons was seen to be considerably better than a multi-binomial procedure 
postulated and the asymptotic powers of these two tests are plotted for a number of examples 
along with similar values for the analysis of variance. 

Examples of the use of the power function developed were given in application to an 
experiment in taste testing. Estimated variances and covariances of the estimators of the 
example were computed. 

REFERENCES 

ABELSON, R. M. & Brapuey, R. A. (1954). A 2x2 factorial with paired comparisons. Biometrics, 
10, 487. 

BraD ey, R. A. (1953). Some statistical methods in taste testing and quality evaluation. Biometrics, 
9, 22. 

BraDey, R. A. (1954a). Incomplete block rank analysis: on the appropriateness of the model for 
a method of paired comparisons. Biometrics, 10, 375. 

BrapD.ey, R. A. (19546). Rank analysis of incomplete block designs. II. Additional tables for the 
method of paired comparisons. Biometrika, 41, 502. 

Brapiey, R. A. & Terry, M. E. (1952). Rank analysis of incomplete block designs. I. The method 
of paired comparisons. Biometrika, 39, 324. 

Cuanpa, K. C. (1954). A note on the consistency and maxima of the roots of likelihood equations. 
Biometrika, 41, 56. 

Cocnran, W. G. (1937). The efficiencies of the binomial series test of significance of a mean and of a 
correlation coefficient. J. R. Statist. Soc. A, 100, 69. * 

Cramer, H. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press. 

Drxon, W. J. (1953). Power functions of the sign test and power efficiency for normal alternatives. 
Ann. Math. Statist. 24, 467. 

Drxon, W. J. & Moon, A. M. (1946). The statistical sign test. J. Amer. Statist. Ass. 41, 557. 

Fix, Evetyn (1949). Tables of noncentral y*. Univ. Calif. Publ. Statist. 1 (2), 15. 











470 Rank analysis of incomplete block designs. IL] 


Haxp, A. (1952). Siatistical Theory with Engineering Applications, §9-9. New York: John Wiley and 
Sons, Inc. 

HEMELRIJEK, J. (1952). A theorem on the sign test when ties are present. Indag. Math. 14, 322. 

Hopxins, J. W. (1954). Incomplete block rank analysis: some taste test results. Biometrics, 10, 391. 

MosTE.LeR, F. (1951). Remarks on the method of paired comparisons: I. The least squares solution 
assuming equal standard deviations and equal correlations. Psychometrika, 16, 3. 

NoETHER, G. (1955). On a theorem of Pitman. Ann. Math. Statist. 26, 64. 

Pearson, E.8. & Hartiey, H. O. (1951). Charts for the power function for analysis of variance tests, 
revised from the non-central F-distribution. Biometrika, 38, 112. 

Pitman, E. J. G. (1948). Lectwre Notes on Non-Parametric Statistics. Columbia University, New York. 

Tana, P. C. (1938). The power function of analysis of variance tests with tables and illustrations of 
their use. Statist. Res. Mem. 2, 126. 

Terry, M. E., Bravery, R. A. & Davis, L. L. (1952). New designs and techniques for organoleptic 
testing. Food Tech., Lond., 6, 250. 

Wa tp, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of 
observations is large. Trans. Amer. Math. Soc. 54, 426. 

WatsH, J. (1946). On the power function of the sign test for slippage of means. Ann. Muth Statist. 
17, 358. 








Line 
are 


vari 
mar 
bler 

It 
poir 
in si 
in ¢ 
tog 








and 


91. 
ion 
sts, 


rk, 
3 of 


»tic 


* of 


ist. 








[ 471 ] 


A METHOD OF ASSIGNING CONFIDENCE LIMITS TO LINEAR 
COMBINATIONS OF VARIANCES 


By A. HUITSON, Pu.D. 
Royal Naval Scientific Service 


1. INTRODUCTION AND SUMMARY 


Linear combinations of variances occur frequently in analysis of variance work, when we 
are considering the estimation either of individual components of variance or of gross 
variability. Percentage points for such functions cannot be computed directly in the same 
manner as that adopted when the x? distribution is used to give percentage points in pro- 
blems where only a single unknown variance is involved. 

It is the purpose of the present paper (a) to derive « series approximation to the percentage 
points of functions of this type, (b) to investigate the numerical behaviour of the expansion 
in some particular cases and (c) to provide tables of the series expansion, suitable for use 
in certain practical problems. An example is given in which use is made of the tables, 
together with some general discussion of the single-factor type of experiment. 


2. DEVELOPMENT OF A SERIES APPROXIMATION 


Let us consider any number of unknown population variances which we shall denote by 
of, 8, ...jep.* 

Suppose that the observed data provide estimates s? of these variances based on f; degrees 
of freedom respectively, and that these estimates are distributed independently of each 


other in the form 
‘ 1 (oa 1 f;,s? f,3? 
eS. Jit xp | ——2#2| a{4#°4). l 
P(si) asi U(sfi) (353 exp| 2 ola) «) 


We are seeking to assign confidence limits, calculable from the observations. to the 





e 
expression }) A;o?, where A, ...A, are known arbitrary constants. In order to accomplish 
i=1 


this, we shall seek a function y, which will satisfy the equation 


P)| = P, (2) 


9 m 
P,| ———- < y(8?, 83,..., 8, 


iM- 
~ 
Q 
~ 


° 


where P. is used to denote ‘the probability of the relation in the bracket following’. lt 1s 
by no means obvious that there is a function y having the required property, but as we shall 
see later it is certainly possible to find various functions which have approximately the 
required properties. We shall be content to leave open the question as to whether our 
problem may be solved exactly. 


* More strictly, in most practical applications, the oj, 3, ..., 0? are parameters which are already 
linear functions of more basic variances. It will be convenient, however, to refer to the oj, 03, .... 77 
themselves as variances since they have éstimates s?,s?,...,82 which are distributed in the form (1). 


31 Biom. 47 











472 Confidence limits to linear combinations of variances 


The large-sample normal approximation to y is 


Sap PM 7): 


where é represents the Pth percentage point of the unit normal deviate. However, as the 
variances are unknown, it will be convenient to take as our initial approximation to y 


eral nf) @) 


All summations are to be taken over the range 1 to r. 
We shall now introduce the notation 


> Asim 
bi ft By ct a A; 8? 
Yun = SRyatym — pr Where c= yaa» 


and denote the corresponding functions of the unknown variances by A,,,, and y; respec- 
tively. Thus our initial approximation to y becomes 


1+€ (2¥,)). (4) 
If we assume that y can be written in the form 
Y =hgthyt+hythgr..., (5) 
where h, is of order f—*, then the large-sample normal approximation implies that 
ho = 1, 
n= Eien) i 


We shall now proceed to derive the next term, h,, in the series expansion of y. Let us 
consider the new variable 
ZA; 8? 
LA,o 





u= 


— (ho +h). (7) 


As wu is a function of the variance estimates, the moment-generating function of wu may be 
found by averaging over the distributions of the s?. On simplification we find the first three 
cumulants of wu to be 

= —£,(2A,,) to order f-!, 
- = 2A,, +4 /2 &{(Ap,)# — Ago(A,;)-#} to order f-#, 
and K, = 8A,._ to order f~*. 

Thus we can see that K, = O(f-+), K, = O(f-) and K, = O(f-*). It appears that w has 
‘cumulants which decrease in powers of f such that K, = O(f-"+). 


Using the Cornish & Fisher expansion (1937) we then obtain after simplification the Pth 
percentage point of u, wp say, to be equal to 


P[2d. ~ 3a" | _2 as Avs to order (3) : (8) 











eX) 


thi 








the 


BC- 


(4) 


us 











[ 471 ] 


A METHOD OF ASSIGNING CONFIDENCE LIMITS TO LINEAR 
COMBINATIONS OF VARIANCES 


By A. HUITSON, Pu.D. 
Royal Naval Scientific Service 


1. INTRODUCTION AND SUMMARY 


Linear combinations of variances occur frequently in analysis of variance work, when we 
are considering the estimation either of individual components of variance or of gross 
variability. Percentage points for such functions cannot be computed directly in the same 
manner as that adopted when the x? distribution is used to give percentage points in pro- 
blems where only a single unknown variance is involved. 

It is the purpose of the present paper (a) to derive a series approximation to the percentage 
points of functions of this type, (b) to investigate the numerical behaviour of the expansion 
in some particular cases and (c) to provide tables of the series expansion, suitable for use 
in certain practical problems. An example is given in which use is made of the tables, 
together with some general discussion of the single-factor type of experiment. 


2. DEVELOPMENT OF A SERIES APPROXIMATION 
Let us consider any number of unknown population variances which we shall denote by 
of, of, ...;o8:* 
Suppose that the observed data provide estimates s? of these variances based on f; degrees 
of freedom respectively, and that these estimates are distributed independently of each 


other in the form ; 
1 /f,2\¥i 1f,83] »(f:82 
2\ da? = ———_— {-*— eRe eS! 4 es 
stds = rar (Sa) P| -3 se |4(sc3): si 


We are seeking to assign confidence limits, calculable from the observations. to the 





g 

expression > A,o?, where A,...A, are known arbitrary constants. In order to accomplish 
i=1 

this, we shall seek a function y, which will satisfy the equation 





P| =*—_ < y(2,, 8f,..., 62, P) | = P, (2) 


where P. is used to denote ‘the probability of the relation in the bracket following’. It 1s 
by no means obvious that there is a function y having the required property, but as we shall 
see later it is certainly possible to find various functions which have approximately the 
required properties. We shall be content to leave open the question as to whether our 
problem may be solved exactly. 


* More strictly, in most practical applications, the oj, ¢%, ..., 07 are parameters which are already 
linear functions of more basic variances. It will be convenient, however, to refer to the Ga, Cay «+«s OF 
themselves as variances since they have estimates 87,83, ...,8% which are distributed in the form (1). 


31 Biom. 47 











472 Confidence limits to linear combinations of variances 


The large-sample normal approximation to y is 


sip 7): 


where £ represents the Pth percentage point of the unit normal deviate. However, as the 
variances are anknown, it will be convenient to take as our initial approximation to y 


oka 8) ade na7)) " 


All summations are to be taken over the range | to r. 
We shall now introduce the notation 


1+é 


> Asim 
ov Cases Legs Ast 
Inn = Rye Fp Where c= Gen 








and denote the corresponding functions of the unknown variances by A,,,, and y; respec- 
tively. Thus our initial approximation to y becomes 


1+€ (2%). (4) 
If we assume that y can be written in the form 
Y =hothythethst..., (5) 


where h, is of order f+‘, then the large-sample normal approximation implies that 


fe 2s) } (6) 
hy = £ (2Vq). 


We shall now proceed to derive the next term, h,, in the series expansion of y. Let us 
consider the new variable 





2 


As wu is a function of the variance estimates, the moment-generating function of wu may be 
found by averaging over the distributions of the s?. On simplification we find the first three 


cumulants of wu to be 

K, = —£./(2A,,) to order f-, 

Ky = 2Ag +4 V2 E{(Ag;)# — Ago(Ay,)-4} to order f-#, 
and K, = 8A,. to order f-*. 


Thus we can see that K, = O(f-+), K, = O(f-) and K, = O(f-*). It appears that u has 
cumulants which decrease in powers of f such that K, = O(f-"+). 


Using the Cornish & Fisher expansion (1937) we then obtain after simplification the Pth 
percentage point of u, wp say, to be equal to 


4A 2A 1 
2 eee co in 
g [24 3 =| 5 An to order (;) ’ (8) 











she 


(3) 


6) 


1s 








c} 


A. Hurtrson : 473 


But this is a function of the ratios of the unknown variances and thus is not of immediate 


use to us. Consider any particular ratio, say o%,/o2, where o2, and 02 are any two of the 
o? ...o%. The mean of s?,/s?, is equal to 


om (_Sn Om() 42 
of 5) = oF (i+z+...). 

Also the variance of s?,/s?, is of the order of f-1. Therefore neglecting terms of order f-+ 
and lower we have in the probability sense s?,/s2 = o?,/o2. Thus, without affecting the 
accuracy of equation (8) we can replace all the ratios of the unknown variances by the 
ratios of their estimates, since all terms on the right-hand side of (8) are of the same order. 

Therefore A,,,, will now become V,,,, and 





leds elem Fe|-3e (9) 


The fact that (9) is of order f-! and not of higher order is a verification that we have taken 
the correct value for h,. Thus the method may be regarded as self-checking, in that, in each 
application, it checks the value which has been obtained previously and adds another term 
to the series. Using the same method the next term of the series expansion for y, hs, has 
been found to be 


hg = V(2Voq) {El — $¥e0(Vor)—* + $V ag (Vor)? — Voo( Vor)? — 92 (Von)? (Var) ) 
+ E3[ — $V (Vor)! + 2Voq — 32 V5 (Vor) > + $Vag(Vex) J}. (10) 


However, the solution was carried no further as the work involved would have been too 
laborious. 


3. CHECKS 


The following algebraic checks have been carried out on the series expansion of y derived 
above. 

(a) It has been shown to agree with an independent development of the series solution 
carried through by Welch (1956) for the same problem, although his treatment and the 
algebra involved are somewhat different in detail from mine. 

(6) It has also been shown to agree with a result published by Bartlett (1953). In this 
paper Bartlett derives a confidence interval for a function of the form o}+ Ao}. To the 
appropriate order of terms, hy +h, +h, is the solution of this equation. 

(c) Asimple check, which has also been carried out, is to consider the special case when all 
the degrees of freedom except one are infinite. A series expansion for the function ZA;o? can 
then be found from the x? distribution and this compared term by term with the series 
derived above. 


4. NUMERICAL TESTING 


The series approximation given by hy+h,+h,+hg, is applicable to the general problem 

involving r variances. In order to carry out a numerical investigation of it, we shall limit 

ourselves to the particular case r = 2 only and we shall assume both A’s have the same sign. 
We can now write 


Vv eo (1—c)" 


A, 57 
mn ~— fn 
fi 


A,8? 
where c= Le ss 
A, 0? +A, 03 


fPis* ry A, 8 +A, 83 





and y= 








474 Confidence limits io linear combinations of variances 


For the cases in which both the A’s are of the same sign, the permissible values of ¢ lie 
between 0 and 1. 

Consider an approximation to the Pth percentage point of (A,s8?+A_s3)/(A,o? +A,03) 
depending only on the sample statistic c (it will also depend on the probability P), g(c) say. 
To test the accuracy of this approximation, we must evaluate 


P| A, 87 +A,83 co 


X02 +A,03 wih Phe (11) 


in order to see how close it is to P. Strictly, such a function as g(c) is not an approximation 

to the Pth percentage point of (A, s? + A, 82)/(A,o? + A,o3), as this is a function of the un- 

known variances while g(c) is independent of these variances. It can only be regarded as 

such an approximation in the sense that it makes the value of Q in equation (11) close to P. 
It can be shown by straightforward transformations that 








ee eee 
ieee Tel = | agp (Pt teay, 

p(b) db = RT bb — byt db 
and - jedidhfrcjot 


From equation (12) the following values of Q were obtained by quadrature when 
Sf, =f. = 10 with g(c) replaced by the series approximation, cut off at various points, in order 
to see what effect each successive term of the series had on the result: 


Values of Q when f, = f. = 10 














| 
| P=0-05 P=0-95 
BREE ES PLETE Paped BT ib 
| ys05 8=6| ly =0-2 y=0-5 y=0-2 
g(c) =ho+h, 0-0197 0-0269 09424 0-9419 
g(c) =ho thy +h, 0-0358 0-0469 0-9547 09533 
gic) =hy+hy+he+hy | 0-0453 0-0544 0-9497 0-9501 


* ects en ee hed " 








The valucs of Q for y = «,say,and y = 1—a will be identical when /, = f,. Therefore the 
values of Q when y = 0-8 will correspond to those given in the third and last columns of 
the above table. In all except the third column it is seen that the addition of each successive 
term to the series approximation brings the value of Q closer to the chosen value of P. In 
the third column the addition of h, has moved Q in the right direction, but too far. 

It may also be noted that the values of Q obtained when P = 0-05 are much more in error 
than the corresponding ones for P = 0-95. This suggests that the series approximation may 
not be so good at the lower end of the distribution as it is at the upper end. 





























A. Hurrson 475 


This comparison relates only to the case r = 2, and A, and A, positive. More work would 
be necessary to assess the merits of the series for the more general situation. The results for 
the case considered are, however, sufficiently promising to make it appear worth while to 
table some results for the larger numbers of degrees of freedom at least. Such tables are 
presented in the next section. 


5. TABLES 


A two-decimal table will usually be quite adequate for a working statistician. Four pro- 
bability levels are tabled in this paper (Tables 1-4). 

We shall regard the series solution as giving us two-decimal accuracy when h, < 0-005. 
The argument here is that if A, is negligible, then so, almost certainly, are higher terms of 
the series. If h, is not negligible, it does not follow that the higher terms are not negligible. 
but we have no absolute justification for assuming that they are. Using this criterion, the 
tabled values will be correct, except possibly those for which one of the degrees of freedom 
is 16, which may be one or two units in error in the second decimal place. 

The values of degrees of freedom for which the tables have been constructed are such that 
their square roots form a harmonic set. The purpose of this is to make interpolation with 
respect to the degrees of freedom easier. Thus, for intermediate values of /, it is necessary 
to interpolate with respect to 12/,/f. Tables 1-4 can be used to assign 90 and 98 °% confidence 
intervals to any sum of two variances, provided the degrees of freedom of the variance 
estimates are large anough. 


6. EXAMPLE 


The following example is given by Tippett (1952, p. 111). It is of the single-factor analysis 
type. Mule cops (bobbins) of cotton yarn were collected in blocks of 20, each block being 
from a different mule. Two leas (1 lea = 120 yards) were weighed from each cop, giving for 
each block 39 total degrees of freedom, 19 * between cops’ and 20 ‘within cops’, together with 
corresponding sums of squares. There were six such blocks and the six sets of sums of squares 
and degrees of freedom were added to give the following table: 











Source of Sum of Degrees of Mean Expected value 
variation squares freedom square of mean square 
Between cops 19,138-85 114 167-88 =8; o? = 203,+0; 
(within blocks) 
Within cops 5,681-00 120 47-34=83 o:=07 
Total 24,819-85 234 























where 03; is the variance between cops and 7 is the interaction (or error) variance in the 
usual manner. { 

The mean squares give an analysis of the variations within the blocks, and these are 
interesting because they are due to factors not easily controllable. The block to block 
variations are not studied because they can be easily eliminated by careful adjustment of 
the mules. 






























































476 Confidence limits to linear combinations of variances 
Tabie 1. Lower 5% critical values of (A,s? + Aq 83)/(A,o? + A. o3) 
Aisi | | 
}; 22 __ |} 00 Of O02 03 O04 OF 06 O7 O8 09 16 | 
A183 +Az83 | 
| 
| Sh fx | 
| 16 | 16 | 0-50 0-52 0:55 0:59 0-62 0-64 0-62 0:59 0-55 0-52 0-50 
36 | 0-65 0:67 0-69 0-71 0-70 0:66 0-62 0:58 0:55 0:52 0:50 
| 144 | 081 0-82 O81 0-76 0-71 0:66 0-62 0:58 055 0-52 0:50 
| | 1:00 0-91 0:83 0:77 O71 066 062 058 0:55 0-52 0-50 
| 
36 | 16 | 0-50 0:52 0-55 0:58 0-62 0:66 0-70 0-71 0-69 0-67 0-65 | 
36 | 0-65 0-67 0-69 0-72 0-74 O75 O74 0-72 0-69 0-67 0-65 | 
| 144 | 0-81 0:83 0-83 083 0-81 0-78 0-75 0-72 0:69 0-67 0-65 | 
| co | 100 0:95 0:90 0-86 0:82 078 0-75 0-72 0-70 0-67 0-65 | 
| is at che Sogs cGare igi ness Se Biers 
are aes 
144 | 16 | 0-50 0-52 0:55 0-58 0-62 0-66 0-71 0-76 0-81 0-82 0-81 
| 36 | 065 0-67 0-69 0-72 0-75 O78 0-81 0-83 0-83 0-83 0-81 
144 | O81 0:83 0:84 0:36 0:36 0:37 0:86 0:86 0:84 0:83 0-81 
| | 1:00 0-98 0:96 0:94 0-92 0:90 0-38 0:86 0:85 0-83 0-81 
kyenaan ‘= ey ee acid . ae 
1 2 16 | 0:50 0-52 0:55 0-58 0-62 0-66 0-71 0-77 0-83 0-91 1-00 | 
36 , 0:65 0-67 0-70 0-72 O75 O78 0-82 0-86 0-90 0-95 1-00 | 
144 081 0:83 0-35 0-36 0-88 0:90 092 0:94 0:96 0-98 1-00 | 
00 100 100 100 100 1:00 100 100 1:00 1-00 1-00 1-00 | 
Table 2. Upper 5% critical values of (A, 8? + A,s3)/(A, 0? + A, o3) 
Axi 
— i; 00 O1 O2 03 O04 O8F 06 O7 O8 O09 10 | 
Ay si + AQ 83 
i. | 
16 16 | 164 1:56 150 1:46 1:44 143 144 1:46 1:50 156 1-64 | 
36 | 142 «4137 «#4135 134 134 136 139 143 148 155 1-64 | 
| 144 | 120 119 1:20 1-22 1-25 1-29 1:33 1:39 146 155 1-64 | 
co | 100 104 109 42113 118 124 130 137 145 154 1-64 | 
= et ead! ie > Pees 5 oi el ee Ey BE 
36 16 | 164 155 148 143 1:39 1:36 134 1:34 1:35 1:37 1-42 | 
| 36 | 142 137 133 131 129 129 1:29 1-31 1:33 137 1-42 | 
| 144 120 118 118 LIS 1:20 121 1:24 1:27 131 1:36 1°42 | 
20 100 1:03 1:06 110 113 117 121 1-26 1:31 1:36 1-42 | 
| ee tay "Ee = ecg ell tg al 
144 | 16 | 164 155 146 1:39 1:33 129 125 1:22 120 119 1-20 
36 | 142 136 131 127 124 121 120 2118 118 118 1-20 
| 144 | 1-20 LIS 116 LIS bd L14 L14 115 116 118 1-20 
| 100 102 103 105 4107 #109 Ll 213) 115 118 1-20 
NGS, PIE dA EOL TR AN a ei Coles hs seal Pe bream +E 
ea 
f co | 16 | 164 154 145 137 130 124 118 113 1:09 1-04 1-00 
| | 36 | 142 136 131 1:26 1:21 117 1:13 120 1-06 1-03 1-00 
| 144 1-20 «118 L156) 1b 11 109 107 1:05 1-03 1-02 1-00 
2 100 100 100 1 100 100 100 100 100 1-00 

















—To 2 F 


a, a Vee, Sa 


A. Hurtson 










































































477 
Table 3. Lower 1% critical values of (A, s? + A, 83)/(A,o? + A,03) 
A, si : 
ee. ee 0- 1 f . ' : " : . 4 uy 
we eS | 0 0 0-2 0-3 0-4 0-5 0-6 0-7 0-8 0:9 1-0 
| 
hi Se | 
16 16 0-37 O39 O41 O45 0-50 053 050 045 0-41 039 0-37 
36 0-53 0-56 059 O61 059 054 048 044 041 039 0-37 
| 144 0-75 0-76 0-74 066 0-58 052 047 044 O41 039 0-37 
co 100 086 0-74 065 058 052 048 044 042 0-39 0-37 
36 16 0-37 O39 O41 O44 O48 0-54 059 O6)] 0-59 0-56 0-53 | 
36 0-53 0-56 0-59 O62 065 066 065 062 0-59 056 0-53 | 
144 0-75 O76 O77 O76 O73 069 065 062 0:59 0:56 0-53 | 
co 1:00 092 085 0-79 0-74 069 O65 062 059 0-56 0-53 | 
144 16 0:37 40:39 O41 044 O47 052 O58 0-66 0-74 0-76 0-75 
36 053 0-56 0-59 062 065 069 O73 O76 O77 O76 0-75 
144 0-75 O77 O78 O80 O81 082 O81 080 O78 O77 0-75 | 
co 100 097 0-94 091 088 085 083 081 079 0-77 0-75 
Zz | 
| 
oe) 16 0:37 O39 O42 O44 O48 O52 058 065 0-74 0-86 1:00 
36 0-53 0-56 0-59 062 065 069 O74 O79 O85 0-92 1-00 | 
144 0-75 O77 O79 O81 O83 O85 O88 O91 094 0-97 1-00 | 
oo 1:00 1:00 1:00 1:00 1:00 100 100 1-00 1-00 1-00 1-00 | 
Table 4. Upper 1% critical values of (A, 8? + A,s83)/(A,o? +A,03) 
Aisi 
0- 0-1 0-2 0-3 0-4 0-5 0-6 0-7 0-8 0-9 1-0 
hatha | °° } 
fi Sa 
16 16 2:00 1-85 1-76 1:70 1:67 1:65 41:67 #«241:70 1:76 1:85 2-00 | 
36 163 155 152 150 41651 154 158 %+163 #241-71 «1:83 8 2-00 
144 1:29 1:28 1:29 1:32 41:36 %b4l 147 «41156 1:68 15? 2-00 
co 100 1:06 I-11 118 1:25 1:33 1-42 1:53 166 1:82 2-00 
| 
| 
36 | 16 2:00 1:83 1-71 163 1:58 1:54 ~~ 1-51 150 152 1:55 1-63 
| 36 1:63 154 149 41:46 4143 +4142 4143 4145 4149 154 = 1-63 
144 1:29 1:27 1:26 41:27 «1:28 «1:31 41:35 140 1:46 1:54 1-63 
oO 100 104 1:08 113 4118 41:24 $41:30 1:37 145 1:53 8 1-63 
144 16 2:00 1:82 1-68 1-56 1-47 1-41 1:36 = 1:32 1:29 1:28 1:29 
36 163 1:54 1-46 1:40 1-35 1-31 1-28 2 1:26 1:27 1-29 
144 1:29 1:26 1:24 1:22 1-21 1:20 1-21 1:22 1:24 1:26 1:29 
foe) 100 1:02 105 107 4110 ‘1:13 41:16 41:19 1:22 1:26 1-29 
co | 16 | 2:00 1-82 1-66 1-53 1-42 1:33 1-25 1-18 1-11 1:06 1-00 
36 1-63 1-53 1-45 1-37 1:30 1-24 1:18 1-13 1-08 1:04 1-00 
| 144 | 1:29 1:26 1:22 119 216 113 1:10 1:07 1:05 1:02 1-00 
| 100 100 100 100 1:00 100 100 100 41:00 1:00 1-00 




















478 Confidence limits to linear combinations of variances 


‘Between cops’ corresponds to the factor and ‘within cops’ to the error; but it is in- 
appropriate to speak of error when the investigation is not a controlled, or partially con- 
trolled, experiment. 

Suppose that we wish to find confidence limits for the gross variability (i.e. the variance 
of a single observation). The estimate of the gross variability, 07, +07 = 40% + 403, is 


A, 83 +Ag83 = 3(47-34) + }(167-88) = 107-61. 
Interpolating in Table | we get a value of 0-83, and from Table 2 we get 1-18. Hence 








A, 82 2 
P.{o-83.< Sit Ass — 1.18] 0.90, 
\ 119% 
107-61 107-61 
2 2 = . 
or P. Tis <%1+%<->.35 0-90, 
or P. {91 < 0% + 03 < 130}=0-90. 


Thus 90 % confidence limits for /(o?+ 03) are approximately 9-5 and 11-4. 
In a similar manner 98 % confidence limits can be found by using Tables 3 and 4. 


7. Discussion 


In order to use the tables given in this paper in experiments of the type described above, it is 
necessary that both rows of the analysis of variance table should have degrees of freedom as 
large as 16. This can be ensured by making the number of levels of the factor large enough. 
If the number of levels is small we shall not get a good estimate of the variability between 
levels, and hence we shall be led to wide confidence limits for the gross variability. When the 
number of levels is very small, they will be so wide that it will not be worth spending time 
estimating them. Thus it is desirable that the number of levels should be as large as possible. 

The ideal experiment for this type of analysis will have two replications for each level 
and a large number of levels, so that both lines of the analysis of variance table are of com- 
parable accuracy. However, if the cost of increasing the number of replications is negligible, 
there is no harm done if more than two are made. It will be advantageous to make the number 
of replications reasonably large, if an accurate estimate of the interaction variance is also 
required from the analysis. 

Although the number of levels, K, should be large, experiments are often made in which 
K is small. For example, in chemical and engineering tests, K has an average of about 10, 
rarely rising above 20. There are two reasons for this use of a small number of levels of the 
factor: 

(i) The cost of increasing K is usually greater than the cost of increasing the number of 
replications. An example of this is found in agricultural field trials where the levels of a 
factor correspond to farms. Here the cost of transporting personnel and equipment to an 
extra farm will be greater than the cost of taking an extra observation on the farms which 
have already been chosen for the experiment. 

(ii) There may be more trouble involved in increasing K. For example, a manufacturer 
may send samples of his product to each of K laboratories, n tests being made on the sample 
in each laboratory. It will be much easier to carry out a few extra tests than to get a larger 
number of laboratories to co-operate. 

















in- 
n- 


ice 


Q 


ve Ge = 














A. Hurtrson 479 


However, there is a way in which large values of the degrees of freedom can arise. The 
experiment can be repeated several times and the results of the separate experiments 
combined. This, in fact, was done in the example given above. I think that this technique 
could profitably be applied to other experiments in which high accuracy is required. 


8. CONCLUSION 


A series expansion suitable for estimating confidence limits for a general linear com- 
bination of variances has been derived as far as the term of order f-!. In numerical work 
for the case of a positive combination of two variances, it has shown itself to be remarkably 
good, the best results being obtained at the upper end of the distribution. In view of this 
fact, tables of the 1, 5, 95 and 99 % points have been calculated and are given in this paper. 
They are suitable for estimating confidence intervals for the sum of two variances, if the 
degrees of freedom are large enough. An example of their use is given in the text. 


REFERENCES 


BartLett, M. S. (1953). Approximate confidence intervals. II. More than one unknown parameter. 
Biometrika, 40, 306-17. ; 
CornisH, E. A. & FisHer, R. A. (1937). Moments and cumulants in the specification of distributions. 
Extrait de la Revue de I’ Institute International de Statistique, 4, 1-14. Reprinted as paper 30 in 
Contributions to Mathematical Statistics, by R. A. Fisher (1950). John Wiley and Sons, Inc. 
Trerertt, L. H. C. (1952). Technological Applications in Statistics. Williams and Norgate Ltd. 
Wetca, B. L. (1956). On linear combinations of several variances. J. Amer. Statist. Ass. 51, Part 1. 











[ 480 ] 


INTERPOLATIONS AND APPROXIMATIONS RELATED 
TO THE NORMAL RANGE 


By JOHN W. TUKEY* 
Princeton University 


Many quantities related to the normal range are only well known for a rather open grid of 
values, and interpolation is often considered difficult (vide David, Hartley & Pearson’s 
(1954) remark at p. 486). Methods of convenient interpolation should thus be of use. 

Elfving (1947) established a general asymptotic result for ranges of samples from sym- 
metrical distributions. For the normal distribution this implies that 


n i * ete de 
dw 
has a limiting distribution. The usual limiting value of Mill’s ratio yields 
f° ee? daw u-1 ete", 
u 


When we combine these and take logarithms, we learn that 
w* + 8Inw—8Inn 


has a limiting distribution. We might then expect, when the middle term could be neglected, 
that w* would behave like 8Inn = 18-42log,,n, while the reciprocal of the variance of w 
would be proportional to 
(“= +8Inw) 
dw 


Thus the use of Ja,v,(%) = @logy) b(n +c) 


2 4\2 
) = su +a) ~ 18-42 logon +8. 


would seem to offer promise as a basis for interpolation. 
With suitable choices of 6 and c, the ratios of suitable powers of: 
(1) w,., and w,.,, the upper 5 and 1 % points of the normal range, 
(2) d,, and V,, the average value and variance of the normal range, 
(3) the ratio d,,/,/V,, and 
(4) (w/s);., and (w/s),.,, the upper 5and 1% points of the ratio of a range to s from the 
same sample, 
to log,,) b(n +c) are remarkably constant. This is shown in Table 1, where the values of n > 20 
are those most likely to be key values in view of the existence of standard deviations for 
the normal range for these values. The ratios have been carried to enough decimal places 
.to show either trends in the ratio, or irregularities which reflect limitations in the available 
tables. 
The constancy of the ratios is so great as to provide not only interpolations but simple and 
rather accurate approximations. With the growing use of internally programmed automatic 


* Prepared in connexion with research sponsored by the U.S. Office of Naval Research. 








(th 


for 
(co 


tal 


is | 
pr 
tio 





Sa 





JoHN W. TUKEY 481 


computers, the already substantial convenience of such simple analytical approximations 
is being substantially enhanced. 

In view of the surprisingly good approximations obtained by choosing 6 and ¢ properly, 
it is important to emphasize that useful interpolation is also quite possible and even simpler, 
since division by the square root of log,, » will make almost all of the quantities treated easily 
interpolable for n between 20 and 1000. 

The 1 % points of the range for n < 20 were taken from Table 22 of the new Biometrika 
tables (based in part on Pearson (1942)), while the corresponding 5° points were inter- 
polated in Table 23 to give a third decimal, values for 30 < n < 100 were taken from Pearson 
(1932), while values for n = 200, 500 and 1000 were computed from the four moments found 
by Tippett (1925), using Table 42 of the Biometrika tables (based in part on Pearson & 
Merrington (1951)). The values so found were: 

















| 
n | 100 | 200 500 1000 
| | 
Ws 6-09 6-49 7-00 7-37 
Wye 6-62 7-02 7-50 7-84 
‘anal 





(the values form = 100 being included for comparison with Pearson’s values of 6-08 and 6-63). 
The approximations (corresponding to ‘std’ in Table 1) 
W5y, = {17-0 logy, (1-5n)}#, 
Wo, = {17-5 logy (3-3n)}4, 
seem to be about as accurate as the tabulated values for 20 <n < 1000 and to be satisfactory 
for n > 6. For some purposes, it is more convenient to write the approximations in the form 
(corresponding to ‘alt’ in Table 1) 
(Wn )sxi/(Wa)sx = 1-49{l0gy9 (1-5n)}4, 
(W,,)10,/(Wa)sq, = 1-51 {logy (3-3n)}#. 


The average values of the normal range, d,,, were taken from Table 27 of the Biometrika 
tables (based on Tippett (1925)). The approximation (‘std’ in Table 1) 


d,, = {17-06 logy, 0-291 (n + 2°6)}# 


is good to about | part in 1500 for 20 <n < 1000, while interpolation in Table 1 for a more 
precise ratio than 17-06 will easily give an accuracy of 1 in 20,000. The alternate approxima- 
tion, which in this case agrees with the approximation just given to 4 significant figures, is 


d,,|dy = 3-66{log,, 0-291(n + 2-6). 


The variances of the normal range, V,,, were taken from Table 20 of the Biometrika tables 
(based on Hartley & Pearson (1951)) for n < 20, from Tippett (1925), as repeated on p. 45 
of the Biometrika tables, for n = 60, 100, 200, 500 and 1000, and from Pearson (1932) for 


n = 30, 45 and 75. The approximation 
L/V, = 1-338 logy, 1-05(n + 4-4) 








482 


Interpolations and approximations related to the normal range 


is apparently about as good as the tabulated values for 20 < n < 1000 and is quite useful for 
n>6. The alternate form differs by about 2 parts in 1300 and is 


V,/Vo = 1-03/log,, 1-05(n + 4-4). 


Table 1. Ratios of various quantities to logy) (b(n +c¢)) for selected values of n 



































Quantity} = [ws., }* [wy oJ? de dal {Vn dy/ Vn 1/V,, [(20/8)5 0, }® | [(w/s), 0, 
b 1-5 3:3 0-291 0-52 0-494 1-05 0-78 1-1 
c 0-0 0-0 2-6 2-5 1-5354/n 4-4 —4:5 —74 
n=2 16-10 16-17 10-0 3°58 6-84 1-66 Imag Tmag. 
3 16°82 17-05 13-5 4-18 4-900 1-423 Imag Imag. 
4 16-9 17-28 15-0 4-42 4-8340 1-366 Imag Imag. 
5 17-000* 17°38 15-69 4-56 4-8347 1-3440 Imag Imag. 
6 17-035* 17-47 16-12 4-630 4-8349 1-3343 133- Imag. 
8 17-037* 17-52 16-57 4-711 48351 1-3348 39- 86- 
10 17-026* 17-53 16-78 4-7497 48351 1-3340 21- 32: 
15 17-041* 17-53 16-99 47871 4-8353 1-3359 19-04 21-3 
20 17-006* 17-54 17-054 4-7984 4-8352 1-3370 18-62 20-10 
30 17-06 17-50 17-082 4-8034 4-8345 1-3378 (18-41) (19-73) 
45 16-96 17-47 17-079 48029 4-834] 1-3382 nia na 
60 16-98 17-50 17-070 48015 4-8343 1-3382 18-49 19-95 
75 17-03 17-49 17-062 4-8001 4:8336 1-3377 nia nia 
100 16-99+ 17-45¢ 17-052 4-8008 48355 1-3393 18-59 20-14 
200 17-00 17-43 17-036 4-7979 4:8345 1-3388 18-64 20-17 
500 17-04 17-48 17-044 47951 4-8317 1-3370 18-62 20-14 
1000 17-10 17-49 7-056 4-8008 48358 1-3396 18-59 20-03 
std 7-0 17-5 17-06 4:8 4-835 1-3380 18-6 20-1 
alt 17-5 17-52 17-06 — — 1-336 18-49 20-25 














* Based on interpolation in the cumulative distributica to a 3rd decimal. 


* 17-04 by method used for n> 100. 


~ 17-40 by method used for n> 100. 


Imag.=imaginary, n/a=no percentage point tabulated. 
Values in ( ) based on interpolations by Pearson (1932). 


Two sorts of approximations to the reciprocal of the coefficient of variation of the range 
are given. The first yields 


d,,|VV,, = 4:8 logy, 0°52(n + 2-5), 


and is accurate to the accuracy of the known values or to 1 part in 1000 for n> 20. The 
second, which illustrates an altered form for c, yields 


d,,|, Vj, = 4835 logy, 0-494(n + 1-535 ¥/n), 


which is accurate to the accuracy of the known values or to | part in 4000 for n>4. (No 
alternatives are provided.) 





th 


or 








Joun W. TuKEY 483 


The percentage points of the ratio of a range to the standard-deviation estimate, s, from 
the same sample were taken from the recent paper by David et al. (1954). The approxi- 


mations 
(w/8)5o, = {18-6 log,, 0-78(n — 4-5)}4, 


(w/s),, = {20-1 logy, 1-1(m — 7-4)}2, 


are not quite as good as the key values, the ratios in Table 1 showing a tendency to run low 
near n = 50, and, probably, high near n = 300. The amount of this effect is near the limit 
of accuracy, and should not amount to more than about 1 part in 400 for 20<n< 1000. 
About as much more error is introduced by two-significant-figure constants in the alter- 


nate forms : 
(w/s)5o, = 4:3{logy, 0°78(n — 4-5)}4, 


(w/s)1o, = 4°5flogy 1-1(n —7-4)}3. 


METHODS 


The approximations given are not best in any specific sense, but have been chosen to fit 
reasonably well. Except for w;,, and w,.,, where c = 0 works quite well, the values of b and 
c were usually determined to make the ratios for n = 20,100 and 1000 nearly the same, 
which can be easily done by trial and error solution of 


logy (1000 +c) — log, (100 +¢) 
(%1000/ 100) — 1 


= logy9(20+¢) — 


logy» (100 +c) — 





logy (1000 +c) — logy, (20 +c) 
(%000/%20) — 1 





the common value of these differences being taken as —log,,)d. 


ILLUSTRATION OF APPROXIMATION 


To illustrate the use of the technique in more refined approximation, Fig. 1 shows values of 
antilog,, (d,,/4:835 /V,,) — 0-496n — 0-6 log,, 


plotted against log,,”. For larger values of n the dashed vertical lines indicate the uncer- 
tainties associated with the fact that V, is given to only 3 decimals for n = 60, 100, 200, 500 
and 1000, and toonly 4 decimals for 30, 45 and 75. Interpolation to nearly 4-decimal accuracy 
in V, is clearly possible from n = 20 nearly to n = 100. Better values for n’s between 100 
and 1000 would clearly be obtained from one or two 4 or 5 decimal values of V,, rather than 
by many more values of lower accuracy. A single highly accurate quadrature for an n of. 
say, 1001 would probably go far to settle this situation. 
The quantity plotted can be regarded as 


0-496c’ — 0-6 log,,) ”, 
where d,| JV, = 4835 logy, 0-496(n +c’). 
Here 4-835 was chosen from the second approximation to d,,/,/V,,, while 0-496 was modified 


from 0-494 to give a relatively straight plot for n between 10 and 100. The use of 0-6 log,)n 
is not essential, and serves merely to simplify the plot somewhat. 











484 Interpolations and approximations related to the normal range 


A further use of the same technique is in supplementing the available values of the lower 
percentage points of w/s. The lower 2-5 % point serves as a good example. Fitting ton = 10. 


100 and 1000 yields 3-907 + {logy [(m + 11-42)/7-79}}# 


as a first approximation. Keeping the 3-907, and adjusting the 7-79 to 8-14 to produce 
reasonabie linearity against log n, leads to the use of 


z = 8-l4antilog,, (y/3-907)? —n — 4-6 log,)n, 


as a quantity of reasonable constancy, where y is the lower 2-5 % point of w/s. 


























1:44 3:S2 
7 
137-7 T T T T T T 13 
’ 
12 Y | -4r2 
7 
7 
\v 
AA 

11 Y 1-1 
5 4 
o 4 
2 7 
y 
o 10 q Pa + 10 
{ a 
x 7 
¥ 7 
oO 4 
$ 7 
Sf 09- 4 409 
é i 
3 
= 08 Fag 08 

ral 
} Fl 
——" 
0:7 LO a a eg 4 0:7 
é 
2 5 10 20 50 100 200 500 1000 


Sample size, n. 
—1:35 —2°71 


Fig. 1. Residuals for d,/,/V,, where c’ is defined by d,/V, = 4-835 log,,) 0-496 (n+c’). 
The numerical] values found are as follows: 
n 3 10 15 20 60 100 200 500 1000 
z 7:63 7-69 7-79 8-00 8-70 7:97 7-28 7°87 7:89 43-42 


The values for = 2 and 3 are taken from Thomson (1955). There appears to be some ground 
for suspicion that the values of y for 20 and 1000 are 0-02 or 0-03 high. Interpolation for 
n = 4 to 9 should be simple and effective. 


While the original approximation (to w;.,) was obtained entirely empirically, the writer 
owes the suggestion of its asymptotic explanation to his former colleague, David L. Wallace. 











JOHN W. TUKEY 


REFERENCES 


Davin, H. A., Hartiey, H. O. & Pearson, E. 8. (1954). Biometrika, 41, 482-93. 
ELFViInG, G. (1947). Biometrika, 34, 111-19. 

Harttey, H. O. & Pearson, E. 8. (1951). Biometrika, 38, 463. 

Pearson, E. 8S. (1932). Biometrika, 24, 404-17. 

Prarson, E. 8S. (1942). Biometrika, 32, 301-8. 

Prarson, E. 8. & MerrineTon, M. (1951). Biometrika, 38, 4-10. 

TxHomson, G. W. (1955). Biometrika, 42, 268-9. 

Trppett, L. H. C. (1925). Biometrika, 17, 364-87. 











[ 486 ] 


THE GAMBLER’S RUIN PROBLEM WITH CORRELATION 


By C. MOHAN 
London School of Economics and Columbia University, New York 


1. INTRODUCTION 


In the classical problem of the gambler’s ruin there is a constant probability of the gambler 
winning any game and the games are independent. Feller, for example, has treated this 
problem in his book (1950). We consider here a more general problem where there is a correla- 
tion between the results of two successive games. We consider two players X, Y. Conditional 
on winning the previous game, the probabilities that X wins or loses the next game are 7), q). 
and conditional on losing the previous game these probabilities are go, p.. Thus X’s ability 
to win or lose is governed by the transition probability matrix (t.p.m.): 


win loss 
win pi 1 ), (1-1) 
loss \ q. De) 

where Pi+Q, = 1 = Pot Qe. (1-2) 


If we suppose that the stake in each game is one coin, two successive games can have 


for X one of the following pairs of results, the transition probabilities being given before 


ee: ms Ot, es fwd, wt 8, as +28 


On the assumption that a score of + 1 has the same probability as a score of — 1 in the first 
game, the coefficient of correlation is 
Pi— Ws 


B« 5 a 1-3 
V( D1 + Ge) (P2+ 91) (1°3) 





When p, = p2 = p (so that also q, = qg = 1—p = q), the correlation becomes R = p—q. 
This corresponds to a real situation, as, for example, a game between two players, of whom 
the one holding the ‘bank’ has an advantage, the bank passing to the winner of each game. 
We suppose that the probability of winning with the bank is p, and p>gq, so that R>0. 
On the other hand, if the rule were that the bank passes to the loser of each game, we should 
have a negative correlation R. 

Suppose that X starts with a capital r and Y with a capital « —r, and the games are played 
with a certain (positive or negative) correlation R, until one or other of the players is ruined. 
How is the probability of ruin affected by R? In particular, if X had the choice of deter- 
mining whether the games were to be played with a positive or negative correlation, on 
what basis should he decide so as to maximize his chances of winning? 

This problem is analogous to a correlated random walk in the presence of absorbing 
barriers at n = 0 and at n = a, the particle starting from a position r(0 <r <a) and moving 
a unit step to the right or to the left according to the t.p.m. (1-1). 

The random walk is said to be symmetric if p, = p, = p. Then at the end of each step 
there is a probability p of a step being in the same direction as the preceding one and a 





probe 
for th 
In 
the r 
thus 
walk 
In 
quan 


Intre 


the « 


It w 


Den 
diti 
rect 


and 
De 


wh 


i.e. 




















C. Mowan 487 


probability q (= 1—>p) of the direction being reversed. We shall consider first the problem 
for the general unsymmetric case corresponding to the t.p.m. (1-1). 

In case the correlation is zero (i.e. py = q, = p, so that q, = p, = 1—/p), the steps are to 
the right with a constant probability p and to the left with a constant probability 1—; 
thus there is a drift towards the right which is measured by 2p — 1. When p = } the random 
walk is both symmetric and uncorrelated. 


In the case of unsymmetrical correlated waik, the drift may still be measured by the 
quantity 


0 = P\—P>- 
Introducing the quantity 
C= Pit+P2—1, (1-4) 
the coefficient of correlation R given by (1-3) is now given in terms of c and 6 by the relation 
R*(1—6%) = c2. (1-5) 


It will be noticed that in the symmetrical case (6=0) c = R. 


2. PROBABILITY OF ULTIMATE RUIN 


Denote by a, and 6, the. probabilities of ultimate ruin of X (whose initial capital is r) con- 


ditional on the first game being a win or a loss. It is easy to see that a, and 6, satisfy the 


recurrence relations 
A, = Py iy + QO pst 


} (O<r<a—l). (2-1) 
and by.1 = Dob, +24, 
Denoting by # the operation of increasing the suffix r by unity, these may be written as 
(1 —p,H) a, = q, Lb,, 
(EZ — pg) b, = 924, 
which show that a, and 6, satisfy the difference equation 


(EZ — pe) (1—p,E) X, = 9192 EX,» 


ie. (m—*2° 9 Ps Xr = 0. 
Pi Pi 
Hence b,=G+H By, 
1 
1 ‘pa\"*h { De "| 
uh o+H(2 ay o+H(3) 
al _ : \ Pr J 


The boundary values are 6, = 1 and a,_, = 0; whence 


G+ a(22) wih nG+a(2) H = 0, 
Pr 1 

so that p2%/G = —qept/H = po ps-“d1 — dePi). 

Substituting these values of G and H we are led to the probabilities of ultimate ruin 


= G1 A* — qo" > ) 





b, = 
A*—GoA 
rit ate (22) 
“hry -h 
and = paged | 
M1A*— Goa 


where A = p,/p). 


32 Biom. 42 











488 The gambler’s ruin problem with correlation 


Now if c, denotes the probability of initial win and c, (= 1—c,) of initial loss, then the 
general expression for P,, the probability of ultimate ruin unconditional on the result of 


pray. rere a% qy(A* — A") + (AX — GA") ; 


(q,A*~* — qa) 
For A = 1, (2-3) is indeterminate. In this symmetrical case the equations (2-1) reduce to 





P, (2-3) 


a, = pa,,1+9b,44 
and 6,41 = pb, +qa,. 


Hence, it is found that P. is now given by 


__ Cg+(a—1—r)q 
a 1+9(a—2) 





(2-4) 


P, can be expressed in terms of 6 and c, since 
A = (c+1—éd)/(c+1+6) and q,/q, = (1--e+6)/(l1—c—8). 


As the formula in its general form is not easy to interpret, we take for illustration the 


symmetrical case ‘ 
= P2=P,=p with c,=c, =}, (25) 


and study the behaviour of P, as the correlation R varies from —1 to +1. Now 
R=p-q=1-2=c. 
Let p = 2R/(1—R), so that g = $(1—R) = 1/(p + 2). Then from (2-4) 


1-2” 
a 





(2-6) 
148 
a 


Now p varies between —1 and +0 as R varies between —1 to +1, anda>r>0. Hence 
P, lies between } and 1 —(r—4)/(a—1). 
Thus P.2 } according as r5 4a. It will be noticed that while lim P. =}, lim P is different 
R—>1 


R—>-—6 

in the three cases rS 4a. The situation is shown below in the graphs of equation (2-6) for 
three typical cases. Each curve cuts off from the vertical axis an intercept (1—r/a); as 
R->+1, P43; when R>-1, P.>1-—(r—}4)/(a—1). These results are easy to interpret, 
for when R - 1 the probability of ultimate ruin depends less and less on the initial capital 
and should more and more approach the probability of losing the very first game; this is 
shown by the three curves approaching one another as R> 1. The value of P. for R = — 1 is 
actually non-existent, as is borne out by a study of the equations (2-1) which for R = —1 
degenerate into a single set of equations 


a, =b4, (1<r<a—2), 


with the boundary values a,_, = 0, b, = 1. These («— 2) equations are not soluble; in fact, 
these show that a,,a,...@,_, are respectively equal to b,,b,,...,b,_, and none of them in- 
volves either of the unknown probabilities a,_,,5,. On the other hand, when R = 1, equations 
(2-1) become 


a, = 4,4, 6, = Brat (l<r<a—2), 











Fig 








of 





C. Monan 489 
with the boundary values b, = 1, a,_, = 0. Hence 
a, =a,=+...=a,,=090 and 6b, = 6b, =... = b,., =1, 
so that P, = ¢,a,+¢,b,=c, (=4 for the graph). 
We conclude that if X’s capital is initially less than Y’s, it is advantageous for him to 
choose the positively correlated play, and if initially greater than Y’s, then to choose 


negatively correlated play. In each case the greater the correlation in magnitude the more 
it is to be preferred. 

















1 
r=3 i 
P, 
r=6 
1 
2 
r=9 
1 
7 
a=12 
-4 0 1 
R-»> 


Fig. 1. The graphs show the probability of ruin, P,, as a function of the correlation R in a game in 
which the total capital is 12 for three cases of initial capitals of r=3, 6 and 9. 


3. EXPECTED DURATION OF PLAY 


The expected value of the duration of play can also be determined by the use of difference 
equations. The method is justified if we assume that these expectations are finite. Let 
A,, B, be the respective durations of play if the initial capital is r and the first game results 
in a win or loss. 

Then if the first game results in a win the capital becomes r +1 and then the expected 
duration is A,,, with probability p, and B,,, with probability q,. Hence 


A, = 2, Api +H Bet! 
AO Rainey mate (0<r<a—l). (3:1) 
similarly By, = po Be + A,+ 1 
The boundary values are 
B,=1= A,» (3-2) 
Equations (3-1) become homogeneous on putting 
A,=Kr+Léd,, B,= Mr+N+f, (3-3) 
if Kr+L = p(Kr+K+L)+q,(Mr+M+N), 
Mr+M+N = p.(Mr+N)+q(Kr+L)+1; 
i.e. if K =p,K+q,M, L=p,(K+L)+q,(M+N)+1, 


M=p,.M+qK, M+N=p,N+q,L+1. 


32-2 











490 The gambler’s ruin problem with correlation 

Remembering the relation (1-2), viz. p,+q, = 1 = p.+42, these equations are solved by 
K=M, q,L—-N)=M+1 and g,(L—-N)=M-1, 

i.e. K = M = (%,+492)/(—%) and L—N = 2/(9,—4q). (3-4) 


This solution shows that the homogeneity of (3-1) achieved by the substitution (3-3) is not 
disturbed if the same constant is added to both the relations (3-3). This conclusion is cor- 
roborated by a study of the structure of equations (3-1); for if A, and B, are increased by k 
(a constant with regard to r) then by virtue of the relation (1-2) k cancels out from them. 
Consequently one of the constants L and N can be chosen arbitrarily. We therefore make 


the substitution A, = Kr+d,, B= Mr+N +f, 
i.e. set L = 0 in (3-3). This substitution transforms (3-1) into 


dy = Dyn t+Dtrrr Sera = Pale t+ ody 
Proceeding as in § 2 the solution of these equations is found to be 


eu 4+B(P), 
P 


1L 


b= ara nla) 





f r 
eae A, = 272,44 +B(P) 
%1— We 1 
9 r 
and Bip Ls > = +4+B(24) qo 
1-4 91-2 PY nh 


The boundary values (3-2) give 


2\ a—1 £. 
bf (q—1)+4+B(E) = ow 2thts, 44 ph 
eet 1 








11 1-4 eT 

(%—2)(q, +42) +2 — 2q2+2 (2) 

whence B= —_— > —*—., A= -[F 
(92/91 — A*“*) (91 — 2) (1-4) \h 

Thus aa (G1 +42) 7 + 2p2 + (%—2) (4, +42) +2 A°= 99100 

1-42 (91 — 2) 92/9, — A*— (3:5) 

_ (G1 +42) — 242, (&#—2) (Gi +G2)+2G, A'—-1 

and B, = tH = =t 

1-4 (91 — Qe) 91 2/91 — A* 


We now consider the symmetrical case g, = g,. The relations (3-4) are indeterminate in 
this case. We therefore try the quadratic expressions 


Substituting these expressions in 


A, = PAy+9Bii+1, B= pB,+9gA,+1, (3-6) 
and identifying the various powers of r, 
K, = p(K, + L,+M,)+9¢(K,+L,+ M,) +1, 
L, = p(L,+2M,)+q(L,+2M,), M, = pM, + qM,, 
K,+L,.+M, = qK,+pK,+1, L,+2M,=qL,+pl,, M, = qM,+pM,. 





The 


ar 











C. Monan 491 
The third and sixth give M, = M, and the other four are 


q(K,—K,)+pL,+qL.+M,+1 = 0, (3:7) 
q(L,—L,) + 2M, = 0, (3°8) 
q(K,—K,)—-L,—M,+1 = 0, (39) 


and one identical with (3-8). 
Adding (3-7) and (3-9) 
p(L,—L,)+2 = 0, (3-10) 
so that by (3-8) M, = 39(L,—L,) = —q/p, 
and finally by (3-9) 
q(K,—K,) = 1,+M,—-1=L,4-1/p by (3-10). 


Thus A, = K,+r{¢(K,—K,)- 1/py—4 7, 
- (3-11) 
B, = K,+r{q(K,-K,)+1 [p}-7. 
The initial conditions (3-2) give 
K,+(«-1) {q(K,— Ke.) —1/p}—(q/p) (#—1)? = 1 = K,+q(K,—K,) +1, 
which are solved by 
K,=a and K,=-—qa/p. 

Thus (3-11) become 

A, = &+(r/p) (aq—1)—**q/p, } (3-12) 
and B, = —(%+9°)q/p + (aq+ 1)r/p. 


Hence the expected duration of play unconditional upon the result of the first game is 
D, = aal(a —r) +o ra 1 -n]| +e,{r +5 (r—1) (a—n) : (3-13) 


where ¢,, Cc, are, as already defined, the initial probabilities of win and loss. Assuming the 
first win and loss to be equally probable (3-13) becomes 


R- 
1+R 
When R = 0, D, = r(a#—r), in agreement with Feller’s formula (1950, p. 287). Further, 
since r(a—r) > }a?> 4a (as a> 2), it follows that except in the trivial case $a = 1 = r, for 
positively correlated play the duration has smaller expectation than for the uncorrelated 


case and for negatively correlated play it has larger expectation. In fact as R>1, D,> 4a 
and as R>—1, D,>©. 


D, = r(a—-r) {-—a+2r(a—r)}. (3-14) 


4, GENERATING FUNCTIONS FOR THE PROBABILITIES OF THE GAMBLER’S 
RUIN AT THE nth GAME 


Denote by »,,,, the conditional probability that X (whose initial capital is r) is ruined at the 
nth game, given that he wins the first game and by w,,,, the corresponding probability when 


he loses the first game. 
Evidently %,=0 if r>n-2 orif n—-ris ~ 


: (4-1) 
and Un=O0 if r>n or if n—r is odd. 








492 The gambler’s ruin problem with correlation 
Define the generating functions (g.f.) of v, ,, W, n 


Vis) = Erants WAle) = 3B tas (42) 


Elementary reasoning shows that the following recurrence relations hold 


Cr nta = Pr psn $MM rsin 
Writ, n+1 4 Pe s,n + 92Ur,n 
Evidently w,=1, v = 0, a>1. (4-4) 


Define the boundary value 


} (l<r<a—1,n21). (4:3) 


Win=9 (n>1) (4:5) 


in order that (4-3) may hold for r = 1 also. Now multiply the equations (4-3) by s"+ and 
add from n = 1 to 0, then since v, ; = 0 = w,,,,7>1, we have 





V(s) = p,8¥,..1(8) a} (4:6) 
and W,+1(8) = p28W,(8) + q28V,(s). 
Using the operator £ already defined these may be written as 

(1—p,s8£)V,(s) = 9, sEW,(s), 

(EZ — ps) W,(s) = g28¥,(s), 
which show that V,(s) and W,(s) satisfy the difference equation 

(EZ — p28) (1—p,8E) X, = 91928°EX,, 
2 
i.e. pe kee E+? x, = 0. 
P18 Py 

Hence W,(s) = Ani + Byg, \ (4-7) 
so that by (4-6) Vi(s) = (q18)* {Any** + By3** — p.s( Ani + By)}, 


where 7,, 7, are the roots of the quadratic 


(1+cs? 
2_ C488) N+P, = 0. 


Since (4-6) are homogeneous in V,(s), W,(s) and hold for r = 1, 2, ...,@—2, they can determine 
V,(s) and w,(s) for r = 1,2,...,a—1 if two of these 2(«—1) functions are known. These two 
known functions are V,_,(s) and W,(s), since on using (4-2) the relations (4:4) and (4-5) lead 
to the boundary conditions 


V,_3(8) = >> Va—1,n8" - 0, W,(s) 7- > > Ww, ns” = &. (4:8) 
n=1 n=1 : 


For these values (4-7) give 


An, + By, = 8, Ani—\(n,— P28) + By\(n2— 28) = 0, 
whence, writing 


D(s) = D=9, 93-92 — P28) — NE Na(9, — P28), 
we have DA = syf-(9,—P28) and DB = —sy{(y,— p28). 











Subs 


we | 


and 


Itm 
may 


and 


Sim 


FEI 








2) 


3) 


v- 0® 





C. MoHan 493 
Substituting these values in (4-7) and noting that 


(7 — P28) (2 — Pos) = 22-2 (1 + 5% p, + p_— 1)} + pgs? = PedrGe go 
Pr Py Py 
we have finally 
Vi(s) = 9,8°D-(ng ni" — nf 2"), } (4-9) 
and W,(s) = sD-Xn3— yi(No— P28) — NE 93(4, — P28)}- 


It may be noted that the probabilities a,, b, of ultimate ruin calculated in the second section 
may be determined from these functions V,(s) and W,(s). In fact, 


co eo 
a, = x Un = >» Onn = V1), 
n=r+2 n=1 


and 6,= & Un = W,(1). 
=1 
Similarly, we have : 


A,=Vi(l) and B, = W%(1). 


5. OBSERVATIONS ON THE CASE & = 00 
(i) Probability of ultimate ruin 
Evidently when p, >, i.e. when the drift 6 is negative, as «->00, a, and b, as given in 
(2-2) tend to unity. Even when p, = 7,, i.e. the probabilities of a win or loss being continued 
are equal, the formula (2-2) show that with probability unity, X will be ruined; but when 
P2< p,, then the probability of ultimate ruin, according as the first game is won or lost is 


ar or yr, 
q2 


(ii) Duration of play 
(a) When p,> 7, i.e. when the drift ¢ is negative, as «->00, the last terms in (3-5) tend 
to zero and 
A.» (442)? + 2P 2 


r 


will B,> (Qi +92) "= 242 ‘ 
%1— Te 1— We 

(b) When p,<,, i.e. when there is a positive drift 6, as a—->00, (3-5) show that A, and 
B,>©. 

(c) When p, = p,, a8 a>00, (3-12) shows that A, and B,-oo and hence from (3-13) 
D,—>, unless c, = 0 = g. This exceptional case occurs when with probability unity the 
first game is lost, as also is each succeeding game, so that X is ruined in r games. 


In conclusion, I wish to express my gratitude to Dr F. G. Foster, who suggested this 
problem to me and also supervised the investigation, 


REFERENCE . 


FELLER, W. (1950). An Introduction to Probability Theory and its Applications. New York: Wiley 
and Sons. 





[ 494 | 


TABLES FOR SIGNIFICANCE TESTS OF 2x2 CON- 
TINGENCY TABLES 


By P. ARMSEN 
National Institute for Personnel Research, South African Council 
for Scientific and Industrial Research 


1. INTRODUCTION 
In a 2 x 2 tables of the type 





Y Not Y 
z\ « Gn hima | 
Not X h ee | 
| | | 
en ery 
TPO hen. | 








it is assumed that a, b,c and d are the absolute frequencies resulting from a double dichotomy 
of N individuals according to the properties X and Y. 

The null hypothesis to be tested is: 

There is no association between X and Y. If this is true. one would expect in the long run 
of experiments with the same marginal totals A, B, R, 8, to have a:b = A: B, or which is 
the same, V.a@ = A.R. 

The whole 2x2 table can—for fixed marginal totals—be determined by one of the 
entries. It is arbitrary which one would choose. The probability of the set of frequencies 
observed and of possible other sets of frequencies which might have been observed can be 
computed by the exact hypergeometric formula given by Fisher (1941, p. 95).* From this 
it follows that the possible sets of frequencies can be looked at as events with a completely 
known one-variate discrete probability distribution. It is bell-shaped or—in some cases — 
J-shaped, e.g. 




















al 





l oo oo i 


Figure 1. 


* There are detailed discussions of the applicability of the formula in Barnard (1947) and Cochran 
(1952). 











wn 





P. ARMSEN 495 


The range of the distribution of b or d is from 0 to B, if B is the smallest marginal total. 

The present tables are based on direct evaluation of the hypergeometric formula. 

It is possible to look at the given null hypothesis in two ways: 

(1) The object is to determine experimentally whether or not there is reasonable evidence 
for a positive association between X and Y (or, which is the same, for a negative association 
between X and Not Y). In this case it is natural and generally agreed to take the observed 
table as significant (for departure from the null hypothesis) at, say, the 100« % level, if 
the sum of the probabilities of the observed and the possible more extreme tables (extreme 
in the same direction) is smaller than or equal to «, where, for example, « = 0-05 or 0-01. 
This is called ‘the case of one-tailed significance’. 

(2) There is no a priori knowledge about the relationship between X and Y. Here a 
decision must be made before starting the experiment as to which of the possible events 
(sets of frequencies of 2 x 2 tables) are to be taken as evidence for an existing association 
between X and Y (positive or negative whichever the case may be). 

In this case we are looking out for alternatives to the null hypothesis which may throw 
the value of a frequency in the 2 x 2 table, say d, into either of the tails of the probability 
distribution. The usual procedure is to cut from each tail a number of terms for which the 
probability on the null hypothesis sums to <4. If the observed value falls in either of 
these tails we say that the result is significant at the 100a % level, using a ‘two-tailed’ test. 

In analogous problems where the probability distribution is continuous it is possible to 
find significance points, whether one-tailed or two-tailed, defining rejection regions associ- 
ated precisely with any desired value of «. For discontinuous variation which arises when 
dealing with the hypergeometric type of distribution of our present problem or with the 
binomial or Poisson distribution, this is not the case. 

The position is illustrated in Example | given in the following section. If we take « = 0-05, 
and look for significance in the sense of a positive relation between X and Y. we shall only 
establish this ifd = iis 6 or 7. But the probability of this result is only p(6) + p(7) = 0-71 %, 
far below the 5 % aimed at. If we use a two-tailed test based on cutting off regions from the 
two ends of the distribution for each of which the sum of the probabilities is < 2-5 %, then 
we see that this region will again only include i = 6 and 7. 

Unless we adopt a procedure of random sampling for a decision, discussed, for example, 
by Tocher (1950) and Cochran (1952), there is no means of obtaining exactly 5 °% for the 
rejection region. It will, however, be noted that while for the single-tailed test the region is 
clearly defined, for the two-tailed test alternative regions are possible if we drop the usual 
convention of making the probability associated with each tail region sum to 4a or less. 
Thus coming back to Example 1, it might be asked whether there is serious objection to 
including i = 0, 6 and 7 in the two-tailed 5 % rejection region, giving a total probability 
of 2-58+ 0-71 = 3-29 % (still under 5 %) on the null hypothesis ? 

The object of this paper is to consider alternative definitions of the rejection regions for 
the two-tailed test, and also to provide tables of 5 and 1 % significance levels for both one- 
and two-tailed tests, and for 2x 2 tables containing up to N = 50 observations. These 
tables, by adopting a new method of entry, make, it possible to go beyond the range of 
existing tables using relatively little space. 











496 Tables for significance tests of 2 x 2 contingency tables 


2. CHOICE OF THE REJECTION REGION FOR THE TWO-TAILED TEST OF SIGNIFICANCE 


The definition of a two-tailed 100« % significance region referred to in the introductory 
section, which has been used largely, e.g. in Finney’s (1948) tables, may be expressed as 
follows: 

Definition D,. Cut as much as possible but not more than $a x 100 % from each tail of 
the distribution under the null hypothesis. 

While in any particular case, when the hypergeometric probabilities have been calculated, 
it may seem clear on intuitive grounds how to enlarge this rejection region by including one 
or more further values of d and yet keep the total probability below 100« %, when pre- 
paring tables for general use it is necessary to have some clearly defined, consistent rule 
of procedure. 

A possible procedure which the author prefers is given by the following definition which 
is that used in the mathematical description of the tables (pp. 502-5 below): 

Definition D,. Arrange the possible events (defined say in terms of d) in ascending order 
of the size of their probabilities under the null hypothesis; include in the 100a %, two-tailed 
rejection region those events for which the cumulative sum of these ordered probabilities 
is smaller than or equal to «. (The kind of association—positive or negative—which is 
likely to exist will be indicated by the observed event.) 

An equivalent visual description, which can be illustrated on the diagrams given above, 
is as follows: 

Lift a parallel to the horizontal axis until the sum of the probability ordinates falling 
below this line has reached as closely as possible, but not exceeded a. Each event whose 
associated probability ordinate is now completely below the parallel is two-tailed significant 
at the 100a % level. The others are not. 

The advantages of this definition D, are that: 

(i) It is straightforward to apply. 

(ii) It will often include more points in the rejection region than will D,, without 
increasing the overall probability above a. 

(iii) It is consistent in the sense that if an observed event is significant at the 100a, % 
level, then it must also be significant at any 100a, % level, where a, >«,. 

(iv) There is a simple way of generalization to the case of multivariate discontinuous 
distributions, e.g. for the h x k contingency table. 

It can, however, lead to certain anomalies which, as Prof. E. 8. Pearson has pointed out 
to me, may not be regarded as acceptable. Consider the three examples given below. 

The definition D, will be given later. In Example 1, D, includes i = 0 in therejection region 
which D, does not, and this seems very reasonable. For the two-tailed 5 %/ levelin Example 2, 
D, again gives a larger region than D,, although it is open to question whether it would 
not have been better to include i = 0 rather than i = 5 in the region. When, however, we 
take a 7-2 % level* in this example, we find a more serious difficulty. Although 


p(0) = 353% <4x72%, 


D, does not make i = 0 significant, but chooses i = 5 instead. Thus in this instance D, 
contains no term or terms from the lower tail of the distribution. 


* It should be made clear that anomalies of this kind arise only rarely and therefore we have 
illustrated this situation on Example 2, taking an unconventional significance level. 





Exa 


If d 
dist 


— S- Sa 





Example 1 











a c A 33 
b d B a 7 
R 8 N 25 15 | 40 


497 


If d is replaced by a variate 1 and the marginal totals are kept constant, the probability 
distribution of 7 (under the null hypothesis) is as follows: 

















































































































2 0 1 2 3 4 5 6 7 
p(t) in % 2-58 14-25 29-92 30-87 16-84 4-83 0-67 0-04 
Definition Cases of 7 which are two-tailed 
significant at the 5% level 
dD, 6, 7 
D, 0 6, 7 
D, 0 6, 7 
Example 2 
{ *'83 
a | 7 
26 14 | 40 
i 0 1 2 3 4 5 6 7 
p(t) in % 3°53 17-29 32-10 29-19 13-96 3°49 0-42 0-02 
Significant events (values of 7) 
Definition 
Two-tailed Two-tailed 
5% level 7:2 % level 
D, 6, 7 0, 6, 7 
D, 5, 6,7 5, 6, 7 
Ds 0, 6, 7 0, 6, 7 


























498 Tables for significance tests of 2 x 2 contingency tables 
Example 3 
























































31 
z 12 
31 12 | 43 
a 0 1 ao eS 6 7 8 9 10 | 11 | 12 
p(t)in % | 0-920 | 6-624 st 4-435 | 0-877 | 0-102 | 0-006 0 0 0 
Definition Two-tailed significant 


at the 1% level 





D, oC) 20 hi E2 
D, % 2B 10: A). 12 
Dy; 8 9 10 11 12 














If we consider what is the objection to the D, region in this last respect, we seem to reach 
the following broad conclusions. The purpose of a two-tailed test is to pick out departures 
from the null hypothesis in either direction, i.e. to establish significance when there is 
either a marked positive or negative association between X and Y. Clearly it can only act 
in this way if the rejection region contains terms from both tails of the distribution. In the 
case of extreme skewness where the first (or last) term has a probability that is greater than 
100a %, ‘two-tailedness’ is obviously impossible, but in cases like those of Examples 2 
and 3 it is possible to have a two-tailed 5 % (Ex. 2) and 1 % (Ex. 3) rejection region. What 
is less clear is how to define that region in general terms. 

It is possible that a more fundamental attack on the problem could be based on a study 
of the power function of the test.* We shall, however, be content here to put forward a 
definition giving a region which must include that resulting from the customary D, but 
which, as with D,, will often contain more points. If we start {rom the two-tailed region 
of D, there will often be only one choice of a point to be included in the wider region; this 
is the case for Example 1, where the 5 % rejection region under D, (including i = 6,7) 
can only be extended by including i = 0. In the case of Example 2, however, the 5 % region 
of D, (6,7) can be extended either by including i = 0 or 5 but not, of course, both. 

A number of alternative definitions were considered; some of these were ambiguous or 
could lead to the inconsistency referred to under (iii) on p. 496 above. Finally, we have 
taken a definition D,, which appears generally, though not quite always, to satisfy intuitive 

requirements for a two-tailed test. 

Definition D,. Define F(Z), or the ‘first tail’ probability, as the cumulative sum of the 
probabilities under the null hypothesis of all possible events which are more extreme, in the 

* [In this connexion it is perhaps relevant to note that when considering, on the basis of the power 
function, an ‘unbiased’ two-tailed test in the case of continuous asymmetrical variation, Neyman 


& Pearson (Statist. Res. Mem. 2 (1936), 18-25) found that a larger probability (i.e. > 4a) should be cut 
off from the steeper tail of the null hypothesis distribution than from the finer tail.—Ep.] 





12 


n 








P. ARMSEN 499 


same direction, than a given event E, including the probability of Z itself. Define S(Z), 
or the ‘second tail’ probability, as the cumulative sum of probabilities, starting with that 
of the event most extreme in the opposite direction as compared with HZ, and cumulating up 
to but not exceeding the value of F(Z). If, and only if, F(Z) +S(H£) <a, include Z in the 
rejection region for the two-tailed 100« % level of significance. 

The rejection regions defined by D, are shown below the table of probabilities for each 
of the Examples 1-3. 

It can be readily seen that every event which is significant under D, will be significant 
under D,, but the regions under D, and D, will not necessarily correspond. This last point 
is illustrated in Example 2, where D, may be thought to meet the requirements for a two- 
tailed best better than D,. On the other hand, in the case of Example 3, where a two-tailed 
1 % rejection region is required, D, gives a region no wider than D,, failing to include the 
point i = 7. It is interesting to note that in this case there is a genuinely two-tailed region 
which could be used, namely, that including i = 0, 9, 10, 11, 12, giving a total probability 
of 0-926 < 1-000 %. From the point of view of all-round power of discrimination it is possible 
that this last region might have some claim for consideration. 

To illustrate anomalies and differences between definitions, we have naturally picked 
out exceptional cases, but it should be emphasized that these cases are very rare. Thus, 
among all the 2 x 2 tables with N < 50 considered in building up the tables of significance 
points given below, there were only thirty-two cases in which the regions defined by D, 
and D, differed (one case of disagreement usually results in two changesi n the tables). 

In forming the tables with significance levels of 5 and 1%, the computation was done 
independently three times, once on the basis of D, and later on the basis of D, and of another 
alternative definition, not given here, which was finally discarded for possible inconsistency. 
The tables contain the results according to D,. The cases where the result of using D, is 
different are indicated by + or — signs above or below the corresponding entry; thus a 

+ indicates that the entry is to be increased by 1, a — that it is to be decreased by 1, to 
obtain the limit under D,. 

In the following section we give directions for use of the tables with an example, while 
an Appendix contains the mathematical description of the method by which the tables 
with this special form of entry have been built up. 


3. DIRECTIONS FOR USE OF THE TABLES 


N < 50 is the number of observations. 
The tables are sometimes applicable in cases N > 50. See Note 2. 
No 2 x 2 table where N < 6 is significant one-tailed at the 5 % level. 


Step 1. Arrange the given 2 x 2 table in the form 





a c A 

b d B 
& 

R S N 


where A is the largest and B is the smallest (or in case of equality one of the smallest) of 
the four marginal totals, A, B, R, S, and aB> Ab. 
If aB = Ab, the table is not significant. 














500 Tables for significance tests of 2 x 2 contingency tables 
Step 2. Note the four figures: 


d for finding the appropriate table, 


b ~ 
<b = B~ Bu Aacae for finding the place in the table, 


a—d=R-—Bz= A—S20 for decision. 


Step 3. Look at the table at the end of this paper headed i = d and find the intersection 
of arrays x = b and y = c—b. There are four entries 


2; 2% (exceptions see Notes 1, 2 and 3) 


3 % 


for the one-tailed and two-tailed points of significance according to the scheme: 





One-tailed | Two-tailed 
tok heoertad vodlnBiB Lect el 
5% zy 22 
1 % | Z3 | “ 








Step 4. Compare the figure a—d with the entry z;, in which you are interested 
(j = 1, 2.3, 4), e.g. the given 2 x 2 table is two-tailed significant at the 5 % level. if and only 
if a—d>z,. Correspondingly for z,, z, and z,. 

Note 1. If no entry is given, the answer is: 


(a) Not significant for NV < 50 if the missing entry should be to the right or below any 
given entry in the table headed i = d, or somewhere to the right of the whole table: e.g. 
i=1l,2=5,y = 14,a-—d = 2. The missing entry should be to the right of the given entry 


z, = 8,1 = 11, «= 3. y = 14. Therefore the event 
level. 


3 19]. 8- we 
5 1) 8 not significant at the 5% 
0 ' 


(6) Significant two-tailed at the 1 °% level if the missing entry should be to the ieft or 
above any given entry in the table headed i = d or somewhere to the left of the whole table; 
e.g. = 16,2 = 2.y = 13,a—d= 1. Because x< 3, the missing entry should be to the left 

. ‘ . UN a ay BE : 
of the whole table for i = 16. Therefore the event | : 16 |" wet eee 
the top level. a 


(c) The entry — instead of a numerical entry stands for ‘not significant for N < 50’. 


Note 2. These tables are sometimes applicable in cases where B +S < 50< N, i.e. in cases 
where an entry is given or missing to the left. These tables are not applicable if 50< B+S. 
‘In most practical cases (assuming N > 50) where these tables fail to give an answer. the 
Table VIII given by Fisher & Yates (1949) gives an answer and vice versa in all cases for 
N < 50, 

Note 3. In cases where only two entries (counting ‘“—’ as an entry) are given instead of 
four, each entry stands for both the one-tailed and two-tailed point of significance. 











con 





on 











P. ARMSEN 501 
Example for application of the tables 














Given 
4 18 22 
12 9 21 
16 27 | 43 
Step 1 
18 9 27 The only possible other arrangement 
: 4 sa after having chosen B as the smallest 
marginal total is 
22 21 | 43 9 18 27 
12 4 16 
B= 16< 2] <22< 27, | 
A = 27>22>21> 16, 21 22 43 
a=18, b6=4, But now a=9, b=12, 
Ba = 16x 18 = 288>108 = 27x 4 = Ab; Ba = 16x 9 = 144< 324 = 27x 12 = Ab, 
consequently the arrangement is correct. contrary to the agreed kind of arrangement 
Step 2 
d=12 for finding the appropriate table, 
b=2=4| 4 { 
mae pay for finding the place in the table, 
a—d=6 for comparison with z;. 
Step 3. In the table headed i = 12 at the intersection of x = 4 and y = 5, the entries 
4 9° ves 2 * 
Ze 8 6 8 
Step 4. Result: the given 2 x 2 table is 
significant (one-tailed) at the 5 % level (6> 1), 
significant (two-tailed) at the 5 % level (6> 2), 
significant (one-tailed) at the 1 % level (6=6), 


not significant (two-tailed) at the 1% level (6<8). 


The entry z, = 2 shows that even the less significant table with the same d, B, S, (6 and c), 
and with a—d = 2or 





14 9 | 23 
4 12 | 16 
18 21 | 39 


is significant (two-tailed) at the 5 % level. But z, = 1 shows that the table with a—d = Oor 





| 
16 21 | 37 


is not significant (one-tailed) at the 5 % level. 











502 Tables for significance tests of 2 x 2 contingency tables 


2, = 8 shows that the first 2 x 2 table with the samed, B and S as the given table, which is 
significant (two-tailed) at the 1 % level, is with a—d = 8 or 





20 9 | 29 
4 12 16 
24 | ee ee 


Note 4. A user who wants to apply definition D, must neglect all + or — signs attached 
to any entry in the table. 

A user who wants to apply definition D, must read a + sign attached to an entry as: 
add 1 to the given entry, and a — sign as: subtract 1 from the given entry. For example: 


In the case i = 5, x = 1, y = 12, by z, = 21 is meant: 


in case of Ds z, = 21 
in case of D, z, = 21—1 = 20 


Correspondingly, in the case i = 9,x = 0, y = 21, by z, = 9 is meant: 


in case of D, z,=9 
incase of D, z4=9+1=10 


APPENDIX 
Mathematical description of the tables 


The tables are completely based on the well-known exact formula: 





(:) (2. 

ae A!IBIR!S! d}) \B-d , 

4; 4) = Miata = (*) - = p(d| BSN), 
B 


where p(d | BSN) is the (single) probability of obtaining d with given marginal totals B and S and given 
number of observations N, assuming that both classifications are independent of each other. 

It follows directly from the arrangement of the table [B = Min(ABRS) and Ba> Ab] that Nd>BS 
andd<B<S<N-B, in particular 2B<N. 

For given d, B, S and N the probability p(i | BSN) (i = 0, 1, 2,..., B) is a step function of i only. This 
function, call it p(7), has the following properties: 

It rises monotonically from 7 = 0 to a maximum which is at imax. = [m], where 


(N+2)m = (B+1)(S+1) 

and falls monotonically until i = B. 

There are some special cases: 

(1) tmax, may be 0 o1 B, in which cases p(i) is monotonically decreasing or increasing in its whole 
range. 

(2) There are two equal values p(imax,) = ?(?max,— 1) in the cases where m is an integer. 

Proof. 

P(t): p(i+1) = n:D, where n = (i+ 1)(N—S—B+i+1) and D = (B—i)(S—%). 


Thus P(t) <p(i+ 1) if v(t) <0, 
P(i)>p(i+1) if v(i)>0, where v(i)(N+2) =n—D, v(i) =i+1l—m. 


Therefore, if i>imax. v(t)>[m]+1—m>0, 
if t<imax, v(t)<[m]—m<0, 


which proves the above statements. 











ed 


1S: 
le: 








P. ARMSEN 503 


It may be shown that BS mr, ax B 1 
DN Stmax, S WN +1. 


Definitions f=f(d|BSN) or ‘first tail’: 
{ld | BSN) = x pli| BSN) 
s=a(d|BSN) or ‘second tail’: 
a(d| BSN) = & vii| BSN), where k<d, 
i= 


p(k| BSN) <p(d| BSN), p(k+1|BSN)>p(d| BSN) (k+1=d in the case d = imax). 


If no k exists (that is, if p(0| BSN)>p(d| BSN)), then s(d| BSN) = 0. 
[k is (given B and S) a function k(d, N) and Nk< BS.] 
t=t(d| BSN) or ‘two-tailed probability’, 
t=ft+e. 
From these definitions follows: 


Statement 1. p(i| BSN)>p(i|BS,N+1) if i2d 
Statement 2. f(d| BSN)>f(d| BS,N +1) exactly. 

Statement 3. p(i| BSN)<p(i|BS,N+1) if i<k(d,N) 

Statement 4. t(d| BSN)>t(d| BS,N+1) not exactly, but with sufficient approximation in practice. 


The construction of the tables is based on Statements 2 and 4. 

For the one-tailed case, it would be enough to compute f(d| BSN) for fixed d, B, S and increasing N 
until the table is found to be significant. Then Statement 2 enables the tables to be condensed to the 
given form. For the two-tailed case, Statement 4 could serve the same purpose if it were exact. The 
inexactness of Statement 4 was the reason for computing all possible cases until N = 50. The com- 
putation was done by direct evaluation of the exact formula for ¢(d| BSN). 

Thus the entries z, and z, give answers based on mathematically exact general reasons. The entries 
2, and z, are (although of course correct in the tables) based only on empirical computation. An exact 
proof of Statement 4 is impossible, as one can show examples where it is not true. 


Proofs for Statements 1, 2 and 3 


(1) (4+ BS).p(i| BSN) = [h+i(N +1)].p(¢| BS, N+ 1), where h = (N + 1)(N+1—B-—S). Statement 
1 follows on account of i(N+1)>iN>dN>BS. 
(2) Follows directly from 1. 
BS’ BS BS B(N —B) N 1 
(3) —=—— = — < a. 
N N+i N(N+1) N(N+1) 4N4+1) 4 


Corresponding to 1, one finds: 





*p(i,N)>p(i,N+1) aslongas (N+1)i>BS8, 
and for (N+1)i<BS, pli,N)<p(i,N+1). 
wo <i< mod follows i = imax, because 
_ BS _ B+S+1_ BS i 
N+1 N+2 (N + 1)(N +2) 
and this, together with i<k(d, N), contradicts the definition of k which proves Statement 3. 


But for the single possible value 7 with 


m 





Conclusions from Statement 3 

(1) k(d, N) is a monotonically decreasing step function of N, or k(d, N)>k(d, N +1). 

(2) As long as p[k(d, N)| BS,N+1)<p(d| BS,N +1); it follows that s(d| BSN) <s(d| BS,N+1), 
but in the cases where p[k(d,N) | BS,N+1]>p(d| BS,N+1), there is one member less in a(N + 1) 
than in s(N), and thus s(d| BSN)>s(d| BS, N +1). 

(It is easy to prove that p(i| BSN)> p(i—1| BS, N +1) ifi<k(d, N).) 


* By p(i, N) is meant p(t | BSN). 
33 . Biom. 42 











504 Tables for significance tests of 2 x 2 contingency tables 
Thus the second tail as a function of N increases monotonically step by step in intervals of N, where 
k(d, N+1) = k(d,N), 
and decreases in big steps from N to N +1, where 
k(d,N +1)<k(d, N) 


(or where the length of the second tail decreases). 
Of course, lim s(d| BSN) = 0, because lim imax, = 0. 
N-o N-w 


Question of Statement 4 
lim ¢(d| BSN) = 0, because t = s+/f, lim s = 0, and lim f = 0, the last equation following from: 


No No N+o 


N- 

Kae ; . B 
lim imax.=0 and lim p(0,N)= lim ~—— = 1. 
No N-« No (3) 
B 


The question now is: Is this decreasing of t((d| BSN) with increasing N monotonic like that of 
Jf(d| BSN) or not? Unfortunately it is not. 
Proof. Computation gives, e.g. for 


d=% B=9, S=36, N= 4, 
«(9| 9, 36, 48) <0-0883, 
«(9 | 9, 36, 49) > 0-0892. 


Therefore, Statement 4 is certainly not exact.* 
The size of t(V) and ¢(N + 1) is influenced mainly by the sums 
TN = p(d| BSN) +p(k| BSN) and TN41 = p(d| BS,N+1)+p(k| BS,N +1), 
where both times k = k(d, N).+ 
The second term in Ty,, is included in ¢(N + 1) only under the condition 
p(d,N+1)>p(k, N +1). 
Thus only these cases must be considered. 


Je] y+ oO 
We have Ty = Ov+on 
Dy 


S\ (N-S S\ (N-S N 
baiaiid = (1) (saa) = (e) (p22) 2*=(5)> 
k = k(d,N) in cy,, also. 

It is easy to show by simple computing that 


B B-d B-k 
tw-Dyu(1~ 553) = Oxn(1—ye zag) +ema(1— Ts) 


and the ratio Ty/Ty,, will be smallest (that is, the ‘bad’ case for Statement 4) if 


oyu = Oya. OF Ty41Dyi, = 2Oyi1, 
and in this worst case one has 


welt -—2_) are. fy 2B Gt) 
it FS) itl 2(N+1—8)]° 


Thus, in the worst case Ty <Ty,, is equivalent to 


BB G+h) 4 opg>(N+1)(d+k 
Mad eateey ht ee ee 
* This difficulty does not arise if the definition D, is replaced by D,, because then Statement 4 is 


exact. But in case of D;, this example applies and D, does not improve the situation as compared 
with D,. 


+ If k does not exist, then s = 0 and Statement 4 is exact. 





ar 





re 





P. ARMSEN 505 


Now both the members in this ‘bad’ inequality are usually of roughly the same size, because: 
if 28S<N+1 then d+k<B, 
if 2S>N then d+k>B, 
and in the case 2S = N +1 both members are equal and therefore 
t(d| B,S, 28) = t(d| B,S,2S—1), 
and only in the rare cases where the three conditions: 
(1) N+. XOn+p» 
(2) Gui 2 TN+» 
(3) 2BS>(d+k)(N+1), 
are all given is it possible to have Ty41>TN> 
and therefore it is only in these cases to be expected that t(N + 1)>¢N). 


tmpirically, it was found that, for N < 50 and the levels of 5 and 1 % significance, no inconsistency 
is introduced to the tables by using Statement 4, in spite of its inexactness. 


The author wishes to thank Prof. E.S. Pearson, to whom he owes a redrafting of §§ 1 and 2, 
Mr A. G. Arbous and Mr J. E. Kerrich, for their critical remarks and helpful suggestions, 
and also the members of the Mathematical Statistics Department of the National In- 


stitute for Personnel research, for their assistance in computational work and editing of 
this paper. 


REFERENCES 


BARNARD, G. A. (1947). Significance tests for 2x 2 tables. Biometrika, 34, 123-38. 

Cocuran, W. G. (1952). x? test of goodness of fit. Ann. Math. Statist. 23, 315-45. 

FrnnEy, D. J. (1948). Tests of significance in a 2 x 2 contingency table. Biometrika, 35, 148. 

FisHER, R. A. (1941). Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. 

FisHer, R. A. & YATEs, F. (1949). Statistical Tables for Biological, Agricultural and Medical Research. 
Edinburgh: Oliver and Boyd. 

Latscua, R. (1953). Tests of significance in a 2 x 2 contingency table: extension of Finney’s tables. 
Biometrika, 40, 74. 

MAINLAND, D. (1948). Statistical methods in medical research. Tables V and VI. Canad. J. Res. E, 
26, 1-116. 


Tocuer, K. D. (1950). Extension of the Neyman-Pearson theory of tests to discontinuous variates. 
Biometrika, 37, 130-44. 


33-2 











Tables for significance tests of 2 x 2 contingency tables 






































































































































































































































(=2° (=3 (<4 Lx O25 
o/2 o 4234 O ¢f 2345 W/Z 
3 \/4\29) ol 2 41414743 ol? 24 3 7 | 432/130 g|O 7 
° '7\37) - 4 4\/4lz0\- | - 4412 7 ele7A-|-| Xo |je4 
7 \20|39 ,|2 2/9/49 j|O  # FOV/7\25 ,|2 02 2 
/ |20 -|- 8 8 \2/\38|- 4 $l} 1/\2//34| - 72\166 
10\27) 4 4\/2\223¢4 4 3\6 €|/3\20B0 o00)}3 3 
2 |39) - 24 tt “l6 6|5 Gleel- |- 213219 9 
74|33 3/¢ &\F 3\e *|F% 3 > 1 3|? 4 45 
3 | 39) - 45 15\33\- 8 Bl 8)3/\- 444i 
a 4 5|\ ““eie7| 4/7169 
a\7P? a|o 2 6) 4 oe 
43 | — | 9 19\~ 70 10\22 22) - | - 6 6\4 4 
2/ 5 |? 22/35 5 |? S12 22015 5\2 2|7 2 
5 |57 2324-|- 42 12.125 25| -| - 7 716 4 
a4 “W248 6 7 | l24I 33|92 
G 6 6 6 
ce 26 2 - IS 15\29 29 99\9 719 
7\* ~ 3 43127 > 7 | 4126 7 |\4 4/04 
ial 3O FA, - 47 47\33 33) - YO 10\2/ 2/ 
ET] /5 1530) 8 9) 18\29 4 5\215 
3}. | 8 hx - 8 lig 19\- -|- 8 \/2 12\24 24 
3S 46 16\33 9 10\2020 5S 61/3 4 
9} ES $i ae 9 \2 abe 
18 18 10 11\22 22 67\44 
A / A 
. 23 23]-  - O N15 sslaa.2 
: 7) 4 12\24 24 7 849 
meg oe Nea 4 \Lonle- - 
- —_s 
/? 2/ 
o / Pd 2 43 13\26 26 2 89 fr: 
o| 4 | 7% -- 28 26)- - 18 48) ~ 
3 
98 |396 13? 23 13\4 4 73 |2 21922 
| 97) 4 Soe Na 20 20] -_- 
2: y 
197 | 594 a 5 25 lt VS 15 /4. 39 HM \2024 
56 | 153 inks ae 2s 24) - 
2 296| - 27 27 46 46 VO /2\2/ 
y “4 % 
oT “a\_ . 5 123 2al-. - 
‘7 
ar = 16 pee 16 | 4 
24 24) 
47 VE B 7 v2 18 
— 4/3 4 
|?’ w“\?* 
+——- —— b—---44 
79 0 20) 79 \74 
we | _- | 
20F* 22 20 |* 
Semana 
ARRANSEMENT. OF THE ENTRIES ae. 
+—_—_—+4 
OME-TAILED| TWO-TAUED| aT - 
7 dee 
s 23 ‘ 
s% z, Ze 
7 
Ah Zz; 24 














Q@-d 2 2, signiticon? 
< 2, 707 significant 


2; : negleJs oll + or — SIGrIs 


‘ Foi 
2,: read’ Ff as {2B oct)’ 

















P. ARMSEN 
















































































































































































= 5 (=G OE Sates 
, er 4 6¢§ é Y eee 3 3 ¢ § ©6 / 
ol* #9 “4 4202228 & o al’ 315 619 a4 ZBleoz ol? 2 
yo vo\i7 17127 2A- -|- -| Pls 216 Clzle aleead- - ov 
,|6 647 a7 425) p|o 213 417 7/2 27 123-| {| 0 
v3 /3\22 22|- - | - - 3 4\6 Bls5 4\2222|- -|- - 42 
o|F 018 42024 28 29 al’ “|4 5|9 9|4 42023 2|° 2 
“7 /7\26 26)- -|- - Oo 1S 6|V M86 B\26 eS 235 
3|° 2% 4e32 3/0 2 4/6 av veslee-| | ges 
20 2030 34 - - 2 2)7 73 32/2) - - | - - oe |4#4 
a2 4% 2267 4|° 3 3/7 a3 a9 2 4|2 O17 = 
23 23)- -|- - 3519 9) 4loa24- - 42\s © 
14 16\2/ 23 5|C O}# 619 9) 42 24 5 | ele 3 
526 26 - - 4 6\1/ 118 18\2727\- - 2 2\|7 6 
46 13\23 25 ol? 215 7|0 “Ie 423 - el? 23 3 
Slaa- - 5 7 | 3a 2i- -|- - 33/184 
5/7 20)26 27 | 21¢ Sl2 ala 20 7|° oh + 
- -{- - 6 8 \/4 /4\23 23|- - 34077 
al? 22 gle 317 244 azo 23 PY als: Sher 
. 8 9.\46 6126 2%) - - 4S5\”/ 4 
2/ 24 2 5|9 014 é#@le2 - 72|15 6 
Si. - 919 vole wl- - |- - I\is elas 
= 
122 2 6G|\OV1/708 43\6 7 
fe lo u\2020|- - \6 7\ 18 17 
25 - 6|4% /2\4% 20 2 317 8 
We. ig “\4 W2\e2 22\- - “\> bla 
4 2142 13120 2/ 2 416 9 
2\2 s3leg 2a - - “\6 9\7 9 
5 8|2 vale - 2 4/39 70 
433 1a) - -|- - 13) 5 s0l19 2 
638|445 3 5\00 4 
“4 Ms 15\- - ‘410 u\20 - 
6 9/5 4 45 13 
vp 
76 46|-_- Olu a-- 
7 10\é /7 + 6\/2 14 
4\)> | - - 46 42 laj- - 
p By « ls 624 
”? 
Ve al- - ‘7 \13 5|- - 
8 18 - 5 714- 
Bie 19\- - 4 /4 4\- - 
94 - 
2 wg\6 7 |\4 
- - 1§ - - «- 
0 fi é 8 
ao}. © 201), - 
aA 
Vs 1d. 
22). . 
V2 
23i- - 
‘Ai - 
ea\* - 
43 - 


2s 




















507 


























































































































































































































! 
508 Tables for significance tests of 2 x 2 contingency tables 
c=7 fue ¢=9 
2. (32 2 ey ee We Laoag ie ee 
0/12 3|6 7|\743\4 7 ol? 27/2 7/3 4174 M12) 15 46 
Pls 3/6 ol S\920|- -|- - 41\$6|92\44|20-|- -| X 
,|12|# #|8 7\2 \/720 ANC 2IF 719 MAMBO , 
5 60 12\6 17\2323|\- - 3 4\7 7 V2 2\74)- -|- - 
2|* 2|F S|e aie 7022 2|7 2\F + 6 8\0 22\4 4 
7 10\13 4\7979|- -|- - 4°\4 5\9 9\ass\20-\- - é| 
34|72|24\7 79 Os s\46\8@ M2 A774) 3 
3\9 1\15 16\2222|- - 22\67|vs\ea\- -|- -| XI, 
a\t 5/9 2|4 49 27 gl? 2|2 3|6 slovas oo 
4 13\/? 8\--|- - 3517813 Ala -\- - 4), 2 
5|€ 70 a5 ale - se 0 0|3 617 9\4 B\6/7 0° 
v3 5\20u|-- |- - o |\4 6\9 als lz-|- - Sle 3 
6\? 224/720 61222 44 7/6 os aes - el? ? 
5 17\22 22|\- - Oss 7\4a“\7779\- -\- - 33 
7 |6 03 779 22 9OOl/ 2|5 89 “W\a ss oo 
U2 a|\- -|- - 7\1 +17 Bl Ale 2\- - Vlas 
9 315 4 a\? 22 2|6 a” 267 m oo 
/920\- - 2218 9vaa\--|- - ofls 5 
g|” #6 20 0017 3|\7 “2s? - oolss 
je’ 22\-_ - 9\3 3\9 wler--|- - 91,716 6 
V2 16V8 - 00\3 418 “\434 oolvred 
a a \4 aliou\i7 \- - 1\7 3\? & 
137? 0/\4 619 21546 ool24 
i i Ae “\4 2\u 12\19 -\- - “\n ala 9 
: V4 18 O/\S B80 Ale - Oo|l3 $ { 
ad Pie 2\salass\- -|- - /2)3 4\9 10 
Ye 19 o2\l6é9l7 “4 00136 
eS 43\6 aliava- - 4313 s\0 vk 
17 - 7 2|? vole 75 Ool*f6é 
+o ‘F\, obs al: - '4\ 4 5\ 1 15 
= Zl? Ba - ools 2 
8 70V6 /?\- - \5 ely - 
23/8 “\4- 0457 
@\5 wo\- -|- - 6\5 $\a - 
2 3\972 07\6é8 
/ 
lo n\- - Maeri- - 
3 4\07 ovles 
“E\or2\- - 48\5 al- - 
/ ARRANGEMENT OF THE ENTRIES  j9\3 #)o - sgl’ 2|7 9 
4 13\- 9 2? &\- - 
ONE-TAILED|\ TWO-TAILED oot 5/7 - 20’ 2/7 
aaa. 6° 8g\- - 
¢ 45 2 2\e - 
5% Zz, Zs 2/\)2 - 215 2\- - 
22\4F 22\2 3 
/% z z var 272) 
56 23 
: = i 23|* * | 
a-ds: Zi; significant = = oa? 4 
< z, nor significant j= —e 
2s\€ ? as\? * 
D,: neglect a//+or — signs — Sang 
3 + L% "Ed ae 
eer 26 |° * 26\* * 
5 + a +" as 
D,: read 2 as} 254 ocr} / |? - 45 
ge ae 27|. * 
26|* - 



































































































































































































































~ 
oft es ST ™ Pee eee ee 
QY. No OO ./o wiOMO HO wir Al wlan Am S¥ ele cle s/o sjo es 
eo Ka YH GO HK 1D NO gio ¥/9 HO H/N BN Nay niy aim am [wo ]¥ ‘lo ‘Te: 
QjQ iV. “IO Dio Jo W/O M9 HO ¥/9 HO BIN AlX DN Nim {iM HIy 8 
gf® SM QIS LS ye 7 SQ ks Soy so eae yonerie yo as as [SS] Te: 
9 fo Yio Vio VIP HSPs © 910 9/9 VJo Jo NID NO m9 MO YO HO 1/910 +19 Joe 
yal? MY So Yo Skee eT rte Per Vv’_ewegeaatenerxarxrea A 
~y [ALM OH S]o Vin Vo /P iy 
we APN NTH DF WH S]o Vo VS IV Ps 
© m9 Hr nim HE So Vin Yo + im [2 |S : 
D9 N1o NO 4S Clu n[m HF SiH She Vr Yo [Qs] ope e 
A “He 9 lo mo ulx olv olm H¥ Sh Vio Vn }o +m [Qi 
a RTO ~S WN fo sJoms ols NIN Om OM Oy S[H Vo Yorfn foe f.. 
2 ee Vo ~fo wiO M19 HS Ol Aly oim A¥ SIF Vo Vio tin elo -lo! 
a a.’ M Fe N [D V[9 VIO Moo $19 Gig Dw DM fm VE Sq Vw .fo [O e[—e eH ue s 
Q: Slo slo ald alo m0 HO WS GIN nia ala alm Sim I [FH -[o 1/0 .]y 
ra N se ‘o n & HIS N[O S/F VS WS Homo 4/0 Fo Hig HID © ~ She OD pee IR x Fh 
yija fe. © 9/9 O19 vo vJo alo wo mio mS FO W/O 4/0 9/9 alr IN ely few fa! 
A Sd a gs ¥ere wee eMgnereResea sag re 
Roly DS [Tf « < 
Min]? XIN VO V/L QV fe rpe yrs 
res ¥ Qo Mn Yim VS «iy. /¥ jo: 
yl PANY Yo Vio Lin aSis | - ley - 
I~ oly Dy Q/4 Vo Yo Mo Ms /V' |e . 
m2 OHM OF Mo S/O Vn LO SIMS /g eS LL | 
OVO VS oly alm HY S/O Dio Vo Via [Peis iy iia . 
i) fi pe ge pe ehcp oes re tee oS NN 
No olo alo mio uly ola nim aly Si¥ Yio Slo ¥n Lio ria + {9 vise 
Tee ea ee a ef 












































































































































































































































! 
510 Tables for significance tests of 2 x 2 contingency tables 
c= 4 é=/2 é=</3 
4+ S$ 6 F wv “iS 2 3 @ Ss 
ol? Sle 4/2 45 F197 ol? 29 29 2/3 Sle 6 a? 
4 3\F 5|9 0143 -|- - Oo 412 3j6 Sion -_- (a 
J? O|7 2\# 6/7 9/7 -| ,\0 ojo ol2 4/5 6/8 9 /|2 2 
3 |3 4/6 7\7 al- -|- - s2\e sia g|- -|--| ‘lag 
oO olo ol2 3/5 elo 7 20 oo “|a sé 7 2/0 O 
2!, 34 sla vol3-|- -| X32 ale slo w|- - a‘\35 
3I? 2 413 617 9170 - 3|7 [10 Oo 4# 6/7 9 31? 2/2 / 
2 4|6 6os3|- -|- - O/\4 4|2 gly -|- - 42|5 6 
al? Cle 26 7/18 0 4? 2/0 Ol2 Sis 7 gio 2/02 
}# 518 8i/2 /4|-_- 21/2')|5 7(|9 /ol- - 2 
5° 3 416 Byou si? P/O Of 2\4 6/7 6 g° a 
6|9 “W- -|- - oO 713 3/6 8\o-|- - /_/\4 67 3 
el’ ‘\4# 7/7 9 gl? Cl? Pe #/5 716 - 6? 21° 2 4 
6 2\/ /3]-_- o /\¢ 418 /ol-_-|- 2\5 2|-_- 
72 2 46” ze ole ojasles| . 3/0 oloal3 5 
8V2-i- - / 2\5 6\9 “M\- - 2 13 5\6 8i- - 
gle 2 9/0 - al? 0 “|4 67 - ale 2|2 O17 3\46 
9 /O\- -|- - 2 3\6 9\/o-| - oO /\i¢ Gif -|- - 
oP 5/72 f° Ol’ 3| 47 gle ole 2/2 {5 - 
vo /2\-_- 143 4lro- - 4 s\6 6\- -|- - 
iol* 78 - 1? Ao ol2 s|5 2 ie ole 2/3 | « 
Ww -i- - 02\45\g -|- - 2 2/6 2\- - 
wy oF * Me? U2 3 Fé - ne OV 3\# - 
. win 5 42|\s6|- -|- - 4134p -[- .- 
e 9 00lo /i4e6 Oolools + 
j, / 
2). . el, ale \- - | 2\6 s\3 6l- - 
> - OO 447? 9 dio dl2 4 
73)" 4315 319 9 - - 13\ 6 sla 2\- - 
g- Oo 2\i5 - oOO0\|0 013 - 
“a. . 413 4\g -|- - #,2\s5 -|- - 
Oool2 3 901I07 
a al. ss EX 
Ool3 4 oave 
le sl. 4 “lo 3|- - 
lo 4|3 - ool - 
/ 
"Ise6l- - M3 3\- - 
O/ld- o 012 - 
h 
ees . Sls 4i- - 
0 olo 
a 4 19\° °| ARRANGEMENT OF THE ENTRIES 
20° i é 20\? ? ONE-TAILED| TWO-TAILED 
ool 2 oo 
“\, 2\- - ol. . 5% Zz, Z2 
oole - OO0\lo- 
2ae ok = 22\6 ,|- - % 
oo oo / Z3 z 
231 3 23), - 4 
al? 7 24\0°|  @-d 3 Zz, significant 
a ey- < Zz, 204 significant 
Ze. : ] 
alo © D;? neglect? al/ + or —- signs 
<= i 
; + ad, 
2,: read = as{ 2% acr}/ 
i 
| 











P. ARMSEN 511 










































































































































































































































(= /3 (=/4 62/5 
6 7? WE 7 g 9 WN6 7 2 9 
ol? “|/ 3|4 5 0 Olo sl2 ait - ol? 2}2 212 “| || 
ie) Be) 1 OR Cee o 42. 3i- -i-_: 
/|0 2\2 4/5 6] Je ojo oo 2/3 4 j0 olo ov 2 
5 6|9 9\- - © 2/3 4/6 7|- - 4 2\4s\- -| . 
al’ 3|* 5 2|e elo of” 3 |4 - 2|0 2f0 of2 - ¢=/6 
2 8|- 13\¢5|- -| - s“l2 3\- -|- -} Xx 
4/15 G 0 olo/|3 4 Oolools s 
Fle of. - 472 4le e|- - 3lo 7|3 4- - 3 
a\3 5/6 - a\e 2|2 O|7 2/4 - glo ojo ol2 Sl 
-|- - o2|4s\|- -|- -}] Nila | 2al- -I- - 5 
si* © 5/2 O}0 7 |2 3 Ooa\|o ajo / 5|° © 
- - 23\5 6j- - QO 4}3 Fir - i 
5-|. oolo z2|3 4 ooloal 2 oo 
él. - 33 alc -|- - ol .2le.-l- - Glo - 
7/2 Glo oO” 3 o9a|90 0 ro] 
4 2\45\- - oe Py 
oolo7le 3 oolgo0|0 7 ooloo 
Sl, 3|s6l|- - Slo ¢|3 3|- - a PAR KS 
900|0 2/3 - 9/2 2}2 217 - (ome) 
Fe Bho. ale = o2\- -} - <0 
ool 2 oolo Oo oo 
/ 
Py ee |, 2\- 3%. - 
0010 4123 ooo oo 
"\, lesb - By2 “l2 a}- - “os 
ool|lo/ 0 Oloolo - oo 
M2), 3\- - 12\5 7|- -|- os Ee 
lo olo2 ooldo 
Bl, g\- - eS 
a2 °OV - 1a\? AP é= 6 (continued) ¢=/7 é=/8 
S28 Ds 4 2\- - > 
= so 2, 8 9 8 x 
w\P O14 iS ° Oloolo - 00 % 
ad i a Silex sal chai eeu an o 
12% 4 “i? ? |e 2 2 
O/|- - P= fe > * 
She cee 2|0 o|o - 
S$ e ele e 
18\9 31? ~ © olo oO 
‘2\- - 
19°? 2 4|2 Ojo - 
—— 2 -|-.- 
20\° ? |? 2 
-- 5 
oo Qo 
u.4 Slo - 














Note: If no entry is given, the answer is: : 
(a) Not significant for N < 50 if the missing entry should be to the right or below any given entry 
in the table headed i=d, or somewhere to the right of that table. 


(6) Significant two tailed at the 1% level if the missing entry should be to the left or above any 
given entry in the table headed i=d, or somewhere to the left of that table. For i>18 all 2x 2 
tables which are possible with N < 50, are two-tailed significant at the 1% level. 








[ 512 ] 


MISCELLANEA 


A note on moving ranges 


By H. A. DAVID 


Commonwealth Scientific and Industrial Research Organization, Sydney, 
and Department of Statistics, University of Melbourne 


1. INTRODUCTION 


Consider a production process in which it takes some time to generate a single item for measurement. 
Under these conditions moving averages and the corresponding moving ranges provide simple current 
measures of location and dispersion. Moreover, with the customary assumption of an underlying normal 
distribution, moving average and range charts may be constructed and used in a very similar fashion 
to ordinary mean and range charts (see, for example, Grant, 1946; Duncan, 1952). One point of difference 
is that for the preliminary estimation of the population standard deviation it is convenient to use the 
set of moving ranges, in samples of size n, that become available as the charts are put in operation. To 
obtain this estimate of o the average of the set, the mean moving range w,(n), need merely be multiplied 
by the well-known scale factor converting a range into an unbiased estimate of a. Clearly w,(2) is the 
mean successive difference. w,() will, in general, increase with n in its sensitivity to any trends in the 
data; but in the absence of such trends w,(5), for example, will be shown to be considerably more efficient 
than w,(2). 

Corresponding to moving ranges we may speak of moving maxima and minima which analogously 
to charts of ordinary maxima and minima (Howell, 1949) are capable of application to quality control. 

This note deals with some properties of moving ranges, maxima and minima. In particular, an 
expression is obtained for the correlation between two ranges in overlapping samples in terms of means, 
variances and covariances of order statistics. For a normal parent population this result is applied to the 
determination of the efficiency of w,(n). The distribution of runs of equal ranges, which tend to occur in 
moving range charts, is also briefly discussed. It is realized that these devices have not been used widely 
and this study is offered in the hope that a clearer understanding of the methods and properties may 
facilitate the decision as to whether their wider application is worth while. 


2. MovING MAXIMA 
Suppose that 2), 2%), ---, Ly +++» My) Tepresent observations arranged in order of time. We may call 
(Xu +++ Lasn—1) the ith moving sample of n, and shall denote the corresponding maximum, minimum and 
range by m,(n), m;(n) and w,(n) respectively. 
The order statistics in a sample of n will be written, following Godwin (1949), as 


X(1|n)>x(2|n)>...>a(n|n); 


or more briefly as Ly PFW%e_P... PZ. 
To obtain the joint probability 


P = Pr[m,(n)<X, ma(n)<Y] (O<d<n) 
it is only necessary to consider the (n+d) values from which m,, m,,4 are calculated. We find 
P=F"(X)FY) (X<Y), 
= F4«X)F"Y) (X2Y), 
where F(X) = Pr(x<X). 


However, we proceed to determine the correlation between m, and m,,4 by a less direct method. The 
z’s will henceforth be assumed continuous. 








Co! 
Then 
whic! 


Writ 


and 


The 


Int 
hav 
Ex 
(Jo 


To 








Miscellanea 513 


Consider first two successive maxima and the set 2), X9), ---» Xn)» Xin+1) (S@Y) on which they are based. 
Then m,(n) = m,(n) = 2(1|n+1) unless xq) or 2,41) is the largest value in the set. In the latter case, 
which has probability 2/(n +1), m,, m, are x(1|n+1), 2(2|n+1). It follows that 


n—-1 2 
=— o 4. 
E[m,(n) m,(n)] =e pola |n+1)] “ee pelea |n+1)2(2|n+1)]. 
Writing x; for x(i | +d) we find likewise for the case d = 2 


—— 6(x,25), 


4 
E[m, (n)m,(n)] = et ai) ster ——& (x 2X X_) +— aa on +1 


ant 
and in general 


= o d—t—1 
E(m, mM, 44) = ok: N+om,2 [es Y Bee.) |. 


The required correlation is now given by 
P(mM,, M44) = [(E(m, mM, 4) —F*(m,)]/var (m,). 


In the case of a normal parent population the first two moments and product-moments of order statistics 
have been tabuiated for »<10 (Godwin, 1949), so that p can be calculated for n<5 (see Table 1). 
Explicit expressions for p can also be given for n < 3 from the exact results for order statistics up ton =6 
(Jones, 1948; Godwin, 1949). 


3. MOVING RANGES 
To obtain the correlation between two ranges in overlapping samples we note that 
E(w, W, 4a) = E(m, — mM) (mM, .4-- M144): 
For a symmetrical population this reduces to 
E(w, W,.4a) = AE(m, mM, ,4) —E(m, mya). 
&(m,m},4) can again be expressed in terms of the expectations of products of order statistics in samples 
of (n +d). It will be convenient to write x} in place of %,,.4,,—-; to denote the jtk smallest value of x. Then 


1+d 
E(m, M44) = akin? E(x 4x5), 


where the coefficients p;; may be obtained as follows. 
Split up the (+d) values, arranged in order of time, into three groups [a], [6] and [c], comprising 
respectively the first d, the central (n —d) and the last d of the x). Then p,; is the joint probability of 


max[{a+6]=2, and min[b+c] = 2}, 


where, for example, max [a+] stands for the largest 2) in the first (a+). Now the joint probability 
of the (i— 1) largest 2; lying in [c] and the (7 — 1) smallest in [a] is 
(d!)*(n+d—i-j+2)! 


(d—i+1)!(d—j+i)i(n+d)! 
We next require 2, to lie in [a + 6] and 2; in [b +c]. According as 2, falls into [a] or [b] this joint probability 
is respectively 
_@-j+@-i+) (n-d)(n—i) 
(n+d—i—j+2)(n+d—i-j+]) (n+d—i—j+2)(n+d—i-j+l) 





Hence we have 
(41)? (n+d—i—J)! 


Pi = (d-i+D!(@— ~F+! (n+)! ,d— J+1) (n-—i+1)+(n—d) (n—7%)). 


The required correlation coefficients are now readily obtained and, for a normal parent, are given in 
Table 1 up ton = 5. 














514 Miscellanea 


Table i. Correlations between maxima and ranges in overlapping 
samples of n having (n —d) values in common 























n d Maxima Ranges 

2 1 0-4264 0-2239 

3 1 0-5990 0-5395 
2 0-2743 0-2062 

4 1 0-6923 0-6651 
2 0-4309 0-3942 
3 0-2028 0-1732 

| 

5 1 0-7506 0-7353 
2 0-5322 0-5097 
3 0-3374 0-3149 
4 39-1611 0-1456 





4, THE EFFICIENCY OF MEAN MOVING RANGES 


When moving ranges are employed in quality control it is convenient to use as the preliminary estimate 
of o, calculated from N observations, the statistic 


N-n+1 
oO, = > > w,|/(N—n+1). 
t=1 


More generally, a mean moving range w, may be constructed based on every 7th sample. 
The efficiency E of w, defined by 


E = vars’ /varw,, 


V-1 N N 
where ful -) x (x,—Z)? /2T >) 3 
2 iat 2 
is given in Table 2 in a number of cases with n<5. For purposes of comparison we list E also for the 


ordinary mean range estimator @ in samples of five and the most efficient (non-overlapping) weighted 
mean range estimator w* (see Grubbs & Weaver, 1947). 


Table 2. Efficiency of various mean range estimators based on samples of n from N observations 











Oy W, wD w* 
N 
n=2 n=3 n=4 n=5 n=4 n=5 —- 
10 0-643 0-735 0-753 0-747 0-785 0-826 0-850 
20 0-623 0-730 0-767 0-772 0-772 0-773 0-799 
30 0-617 0-729 0-775 0-788 0-769 0-756 0-786 
40 0-614 0-729 0-778 0-796 0-768 0-748 0-778 
50 0-612 0-729 0-780 0-801 0-768 0-743 0-773 
60 0-611 0-729 0-782 0-804 0-767 0-740 9-769 
80 0-609 0-729 0-784 0-808 0-767 0-736 0-767 
100 0-608 0-729 0-785 0-811 0-766 0-734 0-762 
200 0-607 0-729 0-788 0-816 0-766 0-729 0-759 
co 0-605 0-728 0-790 0-821 0-765 0-725 0-754 
































of | 





ate 





Miscellanea 515 


It will be noted that H is lowest for w, with n=2. This is, of course, the case of the mean successive 
difference which is, however, less influenced by trends than the statistics based on larger n. 

Compared with @ and w* the w-statistics are seen to gain in efficiency as N increases. Their relatively 
low efficiency for small N is presumably due to the uneven weighting of the observations which is 
inherent in their construction. 

For fixed n all the statistics w, @ and w* tend to normality with increasing N. This is obvious for # 
and w* and has, in fact, been proved explicitly for w, (Hoel, 1946). 


5. RUNS OF EQUAL RANGES 


Two ranges in samples of n, with (n—d) observations in common, will be equal if two of these (n—d) 
are the extremes in the (x+d) values involved. Thus for any continuous population the probability of 


equality is 

py Pass = [(n—d) (n—d—I)][[(n +4) (n+ 1)]. 
Clearly, P;,, may be interpreted as the probability of a run of (d+ 1) or more equal ranges, so that the 
distribution of runs p(r) is determined. It is easily shown that p(r) is J-shaped, flattening out with 


increasing n. 
The expected number of runs is given by 


2 
E(r) = Pa Pay. = (8n—2) — 2(2n— 1) [r(2n— 1) —yY(n)], 


where y(n) = dlogI'(n)/dn. &(r) is evaluated in Table 3 for n< 20 together with the expected number 
of runs of equal maxima, viz. 


n—1 


X (n—d)/(n+d) = n[2y(2n) — 2y(n) — 1). 


d= 


Table 3. Expected number of runs of equal maxima and ranges 
































n Maxima Ranges n Maxima Ranges 
2 1-333 1-000 8 3-606 2-239 
3 1-700 1-167 9 3-991 2-462 
4 2-076 1-367 10 4-375 2-687 
5 2-456 1-579 12 5-146 3-137 
6 2-839 1-796 15 6-303 3-815 
7 3°222 2-017 20 8-232 4-947 
REFERENCES 


Duncan, A. J. (1952). Quality Control and Industrial Statistics. Chicago: Richard D. Irwin. 
Gopwin, H. J. (1949). Ann. Math. Statist. 20, 279. 

Grant, E. L. (1946). Statistical Quality Control. New York and London: McGraw-Hill. 
Grusss, F. E. & WEAVER, L. C. (1947). J. Amer. Statist. Ass. 42, 224. 

Hort, P. G. (1946). Ann. Math. Statist. 17, 475. 

HowE tt, J. M. (1949). Ann. Math. Statist. 20, 305. 

Jones, H. L. (1948). Ann. Math. Statist. 19, 270. 














516 Miscellanea 


Censored samples from truncated normal distributions* 


By A. CLIFFORD COHEN, Jr. 
The University of Georgia 


1. INTRODUCTION AND SUMMARY 


In life-testing and response-time studies, selection procedures sometimes operate to effect a truncation 
at the lower end of the time scale prior to starting a test which is subsequently terminated before all 
sample specimens exhibit the reaction being studied. Resulting samples may thus be regarded as 
censored from a truncated population. The present paper is limited to censored samples from truncated 
normal distributions. It is related to previous studies of truncated and censored samples by Hald 
(1949), the author (1950, 1954), Gupta (1952), and various other writers. It is also related to a paper on 
life testing by Epstein & Sobel (1953), in which some of the advantages of employing censored samples 
to conserve time and test specimens are discussed with regard to a one-parameter exponential distribu- 
tion. For samples of the types considered here, maximum-likelihood estimators (estimates) of the 
population mean and standard deviation are derived, and their asymptotic variances are obtained. 
An illustrative example is given to demonstrate the practical application of these results. 


2. MAXIMUM-LIKELIHOOD ESTIMATION 


The probability density (frequency) function of a normal distribution with mean, m, and standard 
deviation, o, truncated on the left at a fixed terminus, 24, may be written as 


f(x) = (Lo y(2m))-* exp —[(x—m)?/(20%)]  (ay<2x< 00), 


(1) 
f(x) =0 (x%< 2p), 


where J, = 1(£,), is the proportion of the complete distribution retained after truncation and &, is the 
left terminus (truncation point) in standard units of the complete distribution, i.e. 


£,=(xy—m)/o, I) = | P(t) dt, (t) = (27)-bexp (— Jt’). (2) 
E 


Let N sample specimens be selected from a population distributed according to (1) and let their life 
spans or reaction times {x,} (i = 1, 2,3,...,.N) be observed and recorded until a fixed number (n <N) 
have reacted. Let observation be discontinued upon determining 2,. Thus 2, is the greatest of the 
measured observations, and it is known that each of the N —n censored observations exceeds that value. 
When z is distributed according to (1), the likelihood function for a sample of the type described is 


n 
P= {Kio /(27))-" exp [ —-S(x,- mye/(20%) |} (I/I,)"-*, (3) 
1 
where K is a constant, £,=(x,—m)/o and I, = 1(£,). (4) 


According to Gupta (1952), a sample in which the number of censored (unmeasured) observations is 
fixed, is designated as Type II censored as distinguished from Type I censored samples in which the 
terminal is fixed. If rather than stopping a test after n (fixed) observations have been made, it is ter- 
minated at the expiration of a fixed time, x,, then n is a random variable and the resulting sample is of 
Type I. The likelihood function of a Type I censored sample differs from (3) only in that K is replaced 
by a different constant, and in (4), 2, is replaced by x,. Consequently, the same estimators are applicable 
in both cases except for the trivial case of a Type I censored sample in which n = 0. In this latter situa- 
tion, the estimators would not be defined. 
Taking logarithms of (3), differentiating and equating to zero, we have 


OL lh n mom) Aone 
m= od, of *\ vy, 
aLN EG, 2425 (2S) Sot he 


éo o es & 


0, 
(5) 


4 — 0, 
éo el @¢:° GF 


where L = logP and ¢, = ¢(&,) (i = 1, 2). 


* Sponsored by the Office of Ordnance Research, U.S. Army. 











We 














Miscellanea 517 


Let Z, designate the reciprocal of Mill’s ratio, i.e. 


Z, = $,/I, = eva] [* e~t’de (i = 1,2). (6) 
& 


Substitute (6) into (5) and simultaneously substitute m = 2, —0£,, which follows from (2). After simpli- 
fying, we obtain the estimating equations 


v,/w = {(N/n) Z,—[(N —n)/n} Z, —£13/(Es— £1), ee 
S?/w* = {1+(N/n) &,Z,—[(N —n)/n] &,2,—[(N/n) Z, —((N —n)/n) Z}"}/(En—£,)?, (6) 


where w is the restricted sample range, v, is the kth sample momert about the left terminus and 8? is 
the sample variance, i.e. 


(7) 


n 
W>=X_y—-Xy, Ve= 2 (2—%9)*/n, 8? = V,—Vj. (8) 


The two equations of (7) can be solved simultaneously for é, and é 2 as illustrated in §4. With these 

estimates determined, estimates of the mean and standard deviation follow from (2) and (4) as 
G = wif) and m= 2x)—66,. (9) 

The standard maximum.-likelihood symbol (“) serves to distinguish estimates from parameters estimated. 

Estimating equations (7) can be expressed in a form that is analogous to corresponding equations 
previously given by the author (1950) for doubly truncated normal samples. However, Z; as used here 
is not defined quite the same as in the earlier paper. Here Z; = Z(£;) is the reciprocal of Mill’s ratio and 
is a function of £, only, whereas for the doubly truncated cases, Z, and Z, were each defined as functions 
of both &, and &,. 

When truncation is on the right and censoring is on the left, it is merely necessary to reverse the signs 
of all observations and proceed as for the case discussed above since with f(z) truncated on the right, 
f(-—2) will be truncated on the left. 


3. VARIANCES OF ESTIMATES 


The variance-covariance matrix of (f,@) is derived from the second-order partial derivatives of L. 
202 2 aL 
We let $,,(£,&2), Gio(1,$2) and $o.(&,,&,) designate stochastic limits of — cee i a Bl and 
rel n Omi n mea 
—— — as n->o. Thus we have 
n do? 
P1r(E1» Ea) = 1—(N/n) Z,(Z, — £1) + (N —n)/n] Z,(Z, — £2), 
Pro(E1, $2) = (N/n) Z,[1 - §,(Z, — £:)]—[(N —”)/n] Z,[1 — £,(Z2— £2)], (10) 
Poo(E1, $2) = 2+(N/n) €,2,[1 —£(Z, —§,)]-(N —)/n} £2Z.[1 — £(Z, — £2)]. 
The asymptotic variances and the covariance are then given as 
var (Mm) ~ [6#/n] [Poe/($11 P20 — $4a)) 
var (@) ~ [6?/n] (P11/(P11 G22 — $3q)]) (11) 
cov (m, G) ~ [G2/n] [—$12/($11 922 — $.)), 
where bis is written for bislEas be) 
It subsequently follows that the correlation between m and @ is given by 


mo™ —$ralV(Pr1$22)- (12) 


As given above, (10), (11) and (12) are for the Type II censored case in which N and n are fixed while 
2, is a random variable. These same results also apply in the Type I case in which N and z, are fixed 
while n is a random variable, if we replace n, N/n and (N —n)/n by their expected values, 


E(n) = N(I,-—1,)/I,,  E(N/n) = 1,/(1,-—I,) and E[(N—n)/n] = I,/(I,—1,). 


4. AN ILLUSTRATIVE EXAMPLE 


One hundred units of a certain type of electronic device are selected for life testing from a group which 
has survived 600 hr. of prior service. The number of units which expired prior to the time of selection is 
unknown. The total life in hours is recorded as each of the first ninety units of the sample expires, and 








518 Miscellanea 


the test is terminated as soon as the ninetieth unit expires. For purposes of this illustration, we consider 
logarithms {x} of the life spans to be normally distributed so that for the sample selected, we have: 


N= 100, n=90, 2 =1log600 = 2-778151, w= 0-235276, v, = 0-12560147, 
8? = 0-00348667813, v,/w = 0-533847354 and §8?/w* = 0-062987824. 


To solve estimating equations (7), we employ an iterative procedure with initial approximations read 
from a chart recently given by the writer (1954) for graphically solving estimating equations of doubly 
truncated samples. By this procedure, information provided by knowledge of the number of censored 
observations is neglected until subsequent iterations. Thereby we obtain £,° = — 1-60 and £9 = 1-21. 
To improve these initial approximations, we substitute £{” = — 1-600 into (76) and by inverse inter- 
polation obtain £{? = 1-302. This value is then substituted into (7a) and we find &” = — 1-663. Repeating 
the cycle, we find £2 = 1-287 and £ = — 1-626. In the notation employed here £ is the jth approxima. 
tion to £;. Closer approximations to the required estimates could be reached by continuing the iterations 
described above through additional cycles. However, greater improverment can be achieved with less 
labour if we take advantage of the fact that the iterants already obtained locate two points, P,(£{”, £3”) 
and P,(£, £), which lie on the curve defined by (76), and two points, P,(&@, &) and P,(&®, 2), 
which lie on the curve defined by (7a). We approximate the two curves in this vicinity by straight lines 
through these two pairs of points, and coordinates of their intersection provide improved approxima- 
tions to the required estimates. 

Since P, and P, each have the same ordinate, and since P,; and P, likewise have the same ordinate, 
the required coordinates are readily determined by interpolation as summarized below. 








Es £, by (7a) £, by (76) Diff. 
1-302 — 1-663 — 1-600 — 0-063 
1-293 — 1-640 — 1-640 0 
1-287 — 1-626 — 1-663 + 0-037 




















Thus we have é = — 1-640 and é, = 1-293, and from (9) it follows that 
@, = 0080217, m, = 2-90971. 


Since x = log 7’, where T is the life in hours, we estimate the mean life in hours as m, = 812-3 hr. 

If additional decimal places had been required in the above results, we could have continued the 
iterative process described above through one or more further cycles. 

Variances and covariances of the above estimates may be computed using (10) and (11). Although 
ordinary tables of normal curve areas and ordinates are adequate for this purpose, tables provided by 
Sampford (1952) greatly facilitate the computations. Sampford tabulated 


A=Z(Z-£) and €=2[1-&(Z-§&)], 


with 9 designating the argument rather than ¢ as here. In using his tables, however, a word of caution 
is necessary since an unfortunate printing error resulted in negative signs before some of the entries of 
¢ whereas all of these entries should be positive. 

, With €, = — 1-640 and £, = 1-293, we interpolate from Sampford’s tables, and using (10) compute 
$1, = 0-87961, d,. = 0-39418 and ¢,, = 1-12908. On substituting these values in (11) and (12), we 


t 
tater’ var (Mm) ~ 0-0000964, var(@)~0-0000751 and pa,3~ —0°3955. 


REFERENCES 


Couen, A. C., Jr. (1950). Estimating the mean and variance of normal populations from singly 
truncated and doubly truncated samples. Ann. Math. Statist. 21, 557. 

Couen, A. C., Jr. (1954). Truncated and censored samples. Tech. Rep. no. 11, Ordnance research 
contract DA-01-009-ORD-288. Univ. Ga. Dep. Math. 

Epstein, B. & SoBex, M. (1953). Life testing. J. Amer. Statist. Ass. 48, 486. 








Tn | 
exa 
is u 
are 
giv 














Miscellanea 519 


Gupta, A. K. (1952). Estimation of the mean and standard deviation of a normal population from 
a censored sample. Biometrika, 39, 260. 

Ha tp, A. (1949). Maximum likelihood estimation of the parameters of a normal distribution which is 
truncated at a known point. Skand. AktuarTidskr. 32, 119. 

Samprorp, M. R. (1952). The estimation of response-time distributions. II. Biometrics, 8, 307. 


The rapid calculation of x? as a test of homogeneity from a 2 x n table 


By J. B. 8. HALDANE 
Department of Biometry, University College, London 


In genetical work we very frequently have to test a (2 x n)-fold table for homogeneity. We count, for 
example, non-recombinants and recombinants in an experiment on linkage. The frequency of the latter 
is unknown or only roughly known before the experiment. Only if the values found in several families 
are homogeneous have we the right to assume that the value derived from the total has the precision 
given by its standard sampling error. Suppose our table to be 











| ] 
ay | a, | 9A a, | ag | an A 
a eee ei, Toe b, B 

| | | | 

a op a c 
8 | 82 — 8, —_ | 8n vA 

| | | 











That is to say the rth sample of s, members consists of a, members of one class and 6, of the other, while 
La, = A, Xb, = B, A+B =N. The exact contribution of the rth sample to x? as a test of homogeneity 
° (a,B a b,A ? 
wie 

Suppose A > B. If k is the nearest integer to AB-!, (A —kB)/N may be small, say <0-1. If not we can 
take a convergent &/h to A/B, where h and k are small integers, and (hA —kB)/N will be small. 

Then the following is an exact expression for the ‘homogeneity y?’ with n—1 degrees of freedom, 
where h may be unity: 


, or one of many equivalent expressions. 


N?® n 
R= hi kP AB E (hay —bb,)865*— (ha — kB), a) 
=a), 


whilst the following is in error by a factor of less than [2 - 


y= iz 2 5 (ha,—kb,)* ar = (ha KB]. (2) 


If A and B are of the order of 1000, (Ba, —Ab,)? will be an integer with about five more figures than 
(ha, — kb,)*, and the saving of labour is very considerable, while for smaller values most of the calculations 
can be reduced to mental arithmetic. The proof is as follows. 


Let Y= eee 80 X= 3 i 
Let x, = a,—As,N-! = Bs,N-1—6,. 
Then <z,=0 and yx! = N2A-1B-l7287'. 
Also ha, — kb, = (hA —KkB)s,N-*+(h-+k) ap. 
So (ha, —kb,)*3-? = (hA —kB)* N-*e, + 2(h +k) (hA —KB) x, + (h + )* 2? 
= (hA—kB)* N-*s, + 2(h +k) (hA—kB) a, +(h+k)? ABN-*y?. 
Hence z [(ha,—kb,)? 872] = (hA —kB)* N-1 + (h+k)? ABN-*yj, 
r= 


34 ; Biom. 42 











520 Miscellanea 


whence (1) follows. But if hA—kB = D, then 


NT we re 
—aN+D pa iN-P N = ak] 1492 D2 ] 1 
h+k h+k (h+k)? AB hkN hkN? 

whence (2) follows. 

As an example I take Table 1, where a, and 6, are the numbers of normal and vestigial Drosophila 
melanogaster counted in eleven bottles in a class experiment where a 3:1 ratio was expected on Men- 
delian theory, but vestigial was known to be somewhat inviable. The question arose whether the mor- 
tality of recessives had been significantly different in different bottles. Clearly h = 1, k = 6, gives a 
good fit. 

















Table 1 

a, b, 8, a, — 6b, (a, ae 6b,)?/s, 
25 1 26 +19 13-885 
80 15 95 —10 1-053 
38 12 50 — 34 23-120 
52 8 60 + 4 0-267 
9 0 9 + 9 9-000 
21 7 28 —21 15-750 
33 6 39 -— 3 0-231 
24 2 26 +12 5-538 
30 7 37 -—12 3-892 
51 7 58 + 9 1-397 
56 3 59 +38 24-475 
419 68 487 +11 98-608 























n 
Hereh = 1,k = 6, A = 419, B = 68, N = 487, & (a,—6b,)?87>!—(A—6B)2 N-! = 98-360. So formula 


r=1 
(2) gives x? = 16-393, formula (1) y? = 16-709. There are 10 degrees of freedom, so 0:09> P>0-08, and 
the data dre not significantly heterogeneous. Had the expression £[(Ba,—Ab,)?s>1]/(AB) been used, 
the first summand would have been 1281?/26. In practice only two decimal places were used in the 
sum, and formula (2) was used, the whole calculation being completed on the blackboard in about 
ten minutes. 

It is possible to justify formula (2) verbally as follows. X[(h,a,—kb,)?n7+] is the value of x? for n 
degrees of freedom, if we assume &(a,) = ks,/(h+k); while (hA — kB)? N-! is the value of y? for A and B 
on the same assumption. Subtracting this we obtain yj with n—1 degrees of freedom. In fact, the 
dissection of a y* into components is only valid when all samples are large, and in this case leads to 
anerror of 2%. While this dissection is often justifiable, it is desirable to know the magnitude of its error. 


The ‘Inefficiency’ of the sarnple median for many familiar symmetric distributions* 
By JOHN T. CHU, University of North Carolina 


1. A LOWER BOUND FOR THE VARIANCE OF THE MEDIAN 


If the reciprocal of the (asymptotic) variance of an estimate is taken as a measure of its (asymptotic) 
efficiency, the sample median Z is often (asymptotically) less efficient than the sample mean %, for many 
symmetric distributions familiar to statisticians. In fact, for a symmetric distribution having its 
absolute maximum at the point of symmetry, if Z is asymptotically less efficient than %, then quite often 
% is never so efficient as , with the possible exception of very small samples. To show these facts, we first 
derive a very simple, yet sharp, lower bound for the variance of the sample median. 


* Sponsored by the Office of Naval Research. 








Su 
fune 
for a 


The 


Fo 





_— id 


or & 


~~ i 








Miscellanea 521 


Suppose that F(x) and f(x) are the cumulative distribution function and the probability density 
function of a certain continuous distribution, and f(x) is symmetric with respect to x =: £, and f(£) > f(x) 
for all x. Let % be the median of a sample of size 2n + 1; then, writing C,, = (2n+1)!/(n!n!}), 


wee [ "(ee £)* Og Flx)]" (1 — F(a)"fx) de 
1 
-{ (a—£)?C,.F"(1—F)"dF 
0 


1 
2 (nerf (F—4)2?C,F"1—F)"dF 
= {4[f(§)}? (2n+3)}-. (1) 


The equality holds for a rectangular distribution. 


2. EXAMPLES 


It is well known that for normal and rectangular distributions Z is more efficient than %. We shall show 
that this is true for many other familiar symmetric distributions. 

(2-1) Triangular distribution. f(x) = 1—|a|, |x|<1. The variance of Z% is 4(2n+1)-! and from (1) 
it follows that # is less efficient than % for samples of sizes 2n + 1 provided n>1. Direct computation 
shows that this is also true ifn = 1. 

(2-2) ¢-Distribution. f(t) = A(1+t?/k)-**+», where A-! = ['( $k) (km)4/T[H(k+ 1)]. Since of = ki(k—2), 
it follows that for a ¢-distribution with k degrees of freedom, < is less efficient than % if (but not 
necessarily only if) mk—2)0%4k) 2n+3 


4T*™4k+3) ~ 2n-1° 





For both k = 2mandk = 2m + 1, the left-hand side of the above inequality is an increasing function of m. 
Computation shows, e.g. that the inequality holds if k>5 and n> 25. 
(2:3) Symmetric B-distribution. 





T'(2p) 
z)= xw?-l\(1—x)?-1 (0<x2<1; p>0). 
f( Tp) ( ) ( p>9) 
For this distribution 0} = }(2p+1)-1. Hence is less efficient than 2 if 
2n+3 
247-4(2n + 1) T'4( p) /T (2 ——. 
(2p + 1) P*(p)/T%( ad 


The left-hand side becomes smaller if p is replaced by p+ 1, and tends to 47 as p oo. So it has a lower 
bound 47. Hence the inequality holds for every p>0 and n>2. 

(2-4) Cauchy type distribution. This is defined to be of the typef(x) = C,/(1+|x|*)(—w<x<a;a>1), 
If a = 2, we obtain the well-known Cauchy distribution for which Z is infinitely more efficient than 2. 
It would be interesting to examine whether or not % becomes more efficient as @ increases. Now % has 
a finite variance only if a> 3. C, and var can be obtained by using contour integration (see Whittaker 
& Watson, 1952, p. 118). It follows that Z is less efficient than Z if 


y*sin 3y | 2n+3 
sin'y ~ 2n+1’ 








where y = 7/a. The left-hand side is a decreasing function of y, and so an increasing function of «. The 
least values of a for which the left-hand side is equal to 5/3 and 1 (the maximum and minimum of the 
right-hand side), are found to be 4-65 and 3-75 approximately. 


3. REMARKS 


(i) % is not more efficient than % for all symmetric distributions. When the parent population has a 
Laplace distribution, for example, Z is more efficient for all samples of odd sizes (Chu & Hotelling). 

(ii) If f(x) satisfies certain continuity conditions, has an asymptotically normal distribution and the 
asymptotic variance is {4[ f(&)]*(2n + 1)}-1. If the sample size, therefore, is not too small, the asymptotic 
variance is for all practical purposes a lower bound for the variance of x. And if % is asymptotically less 
efficient than Z, then Z is less efficient than Z for all samples whose sizes are not toc small. 


REFERENCES 


Cuu, J. T. & Horetitinc, H. The moments of the sample median (to be pubiished). 
WuirrakeEr, E. T. & Watson, G. N. (1952). Modern Analysis. Cambridge University Press. 


34°2 











§22 Miscellanea 


A simple method of calculating the exact probability in 2 x 2 contingency 
tables with small marginal totals 


By P. H. LESLIE 
Bureau of Animal Population, Department of Zoological Field Studies, Oxford 


It frequently happens that in the case of data which may be arranged in the form of a 2 x 2 contingency 
table, we wish to determine the exact probability of obtaining the observed result. In the method which 
appears to be adopted usually, we have to calculate the value of a number of expressions involving 
factorials. Although the actual calculations are not difficult, they become somewhat tedious even with 
the help of a table of log factorials, and a simple method of obtaining the exact solution may therefore 
be of practical use. The following method is convenient when the marginal totals are small (and usually 
it is only in such cases that we need to calculate the exact solution), and only a table of the binomial 
coefficients up to, say, n = 20 is required. Tables of these coefficients seem to be published up to n = 12, 
e.g. Barlow, and Milne-Thomson & Comrie, four-figure tables; but they may be extended once and for 
all to higher values very simply by means of Pascal’s triangle. 
Suppose we write any 2 x 2 table in tlie standard form: 


z | nN, 


N-n, 





| 


np, N—ng, N 


where N is the total number of observations, and the marginal totals fulfil the following inequalities, 
n4g<N-—ny, ng<N—ng and ng<ny; 


so that nz is the smallest marginal total, and x the number of observations in the cell with the smallest 
expectation. Then the set of possible results which is compatible with the given marginal totals is 
obtained by giving z in turn the successive values 0,1, 2,...,n,, and the probability of obtaining a 
particular value of x is given by the appropriate term of the hypergeometric distribution, 


Pe (S)(na-a) i al 
x Np-x ng 
If, as a first step, we write down opposite the values of xz = 0,1,2,...,ng, starting at x = 0, the 
successive binomial coefficients for nm = n4, which we will call column a,; and if, secondly, starting at the 
bottom opposite x = ng, we write down in reverse order the binomial coefficients for n = N—n,, and 


call this column b,; then the product columna,b, gives the successive terms of the distributiori we require. 
The total of this last column, 


N! 
Lab -- np! (N—n,)! 7p)!” 


zx n 
which serves as a check if it is needed. The calculation of the exact probability & P(x) or >» P(x) (which- 
0 z 


ever tail of the distribution is appropriate in the particular case), then follows very quickly on an 
ordinary calculating machine; in fact, the individual product terms, a,6,, need not necessarily be 
written down. 

Merely as an arithmetical example, suppose we had the following table, where n, = 9, ng = 8, N = 20 
and x = 7, 








for 


C 
s 
E 
a 
s 


= lhe CU! 





st 


- ~~ OO OD 








Miscellanea 523 


Then the two essential columns are the binomial coefficients for n = 9, forming column a,, and those 
for n = 11, forming column 6,. Thus 











2 | a, | b, 
| 
} 
0 1 | 165 
1 9 | 330 
2 36 | 462 
3 84 462 
4 | 126 | 330 
5 126 165 
6 | 84 | 55 
7 36 | 11 
8 9 | 1 








Then, the sum of the product column, Za,b, = 125,970, which can be obtained directly on the machine 
without writing down the individual terms. Since the observed number, x = 7, in the cell with the 
smallest expectation is greater than expectation, we would require in this case the right-hand tail of the 


distribution. Thus 
(36 x 11)+(9x 1) = 405, 


8 
and x P(x) = 405/125970 = 0-003215. 
7 


A test for a change in a parameter occurring at an unknown point 


By E. 8. PAGE 
Department of Mathematics, University of Durham 


1. INTRODUCTION AND SUMMARY 


Consider a sample of independent observations in the order in which they were obtained, 2, ...,; it is 
sometimes required to test the null hypothesis that all the observations are drawn from the same 
population with distribution function F(z | #) against the alternative that x,, ...,2, come from F(x |), 
and @m415-++)L_ from F(z | 6’) (0+). If mis known this is a straightforward problem of comparing two 
samples. In this paper we suppose that m is unknown; this raises new problems. A test is proposed for 
a case where @ is known and some comments are made on the problems presented by other cases. 


2. ONE-SIDED CASE: 0 KNOWN 


Suppose that the initial value, @, is known, One possibility for a test is to regard all the observations as 
a single sample and to use the best test that all the observations are from F(z | #) against the alternative 
that all are from F(z | 0’) for some 6’ + 6. Such a test cannot be expected to be very powerful if the change 
occurs late in the sample; the few observations on the new parameter value would be obscured by the 
many on the old parameter. Since the problem of detecting a change in a parameter is important in 
controlling the quality of the output from a continuous production process, it is reasonable to in- 
vestigate whether the methods of process inspection schemes can provide useful tests for the case in 
which we are interested. 

A process inspection scheme for detecting a change in one direction in the parameter was given by the 
author in an earlier paper (Page, 1954). We suppose throughout that the parameter under consideration 
is the mean of the distribution unless the contrary is stated, and in this section we further assume that 
the value, 0, at the start of the observations is known. The scheme consisted of recording the cumulative 


r 
sums S,= & (x;,—0), So = 0, and taking action to rectify a possible change in the parameter when 
i=1 


S,— min §;>h, i.e. when the sample path rises a height / above its previous minimum value; this 
0<i<r 











§24 Miscellanea 


procedure can be displayed clearly on a chart. If there is no change in 0, the mean path of the cumulative 
sum is horizontal, while if an increase in 0 occurs the new mean path has positive slope so that the above 
criterion would be satisfied without too much delay. The significance test suggested by this procedure 
is as follows: 


I. Given the observations x,...X2,. It is required to test the ah a that the mean is constantly 0. 
Use as the test statistic m = Peso < — min S,}, where S, = z Lae 0), So = 0, taking large values as 
<rqn O0<i<r 


significant, i.e. reject the iepothente if m>h. 

It was shown in the paper cited that the properties of the corresponding process inspection scheme 
depended upon the characteristics of linear sequential tests; as the test I is a truncated form of the 
process inspection scheme it is to be expected that in general the properties of the test will be difficult 
to evaluate. A special case that is tractable is where the observations are nought-or-one binomial 
variables. Accordingly, we consider a test for the general case using binomial variables. 

II. Given the observations x,...x,. It is required to test the hypothesis that the mean is constantly 0. 
Let y, = a if x;-920 and y, = —b if x,—0<0, and choose mr (>0) 80 that E(y;| 0) = 0 (¢ = 1,...,n). 
Use as the test statisticm = max {S,— min S,}, where S, = z Ye So = 0, taking large values as signifi- 

O<r<n 0<i<r 
cant, i.e. reject the hypothesis if m>h. 

For simplicity we shall consider only the case where the distribution of the x; is symmetrical, so that 
we can take a = b = 1; hence y, = sgn(x,—9@). In order to evaluate the properties of the test let 
m, = S,— min S;, and let p, ; be the probability that m, = i (¢ = 0,1, ...,4—1) and that m,<h for alls, 

O<i<r 
l<s<r. h—-1 
<r. Then oo ba (1) 
i=1 
is the probability that the null hypothesis is rejected. Let prob(y;=1) be p = 1—qg. By considering the 
result of the next observation we have the relations 


Pr+t,o = UPr.o+Pr,1)s 
Proi,t = P-Priatd-Priss (L<ti<h-1), (2) 
Pr+i,n—-1 = P-Pr, n-2- 


In matrix notation we have 


Pri. = P.p,, (3) 
where P is the square matrix, 
qq0900 0 0 
pogq «a wo 
P=|0 p 0q... 0 O}- (4) 
a oe ee epee ee 


Initially poo = 1, Mo,, = 0 (t+0). In this formulation we are implicitly using the fact that the m, are 
variables in a Markov chain with h live states. Clearly 


P, = P’.po. (5) 


The expression for p, may alternatively be written in terms of the latent roots and vectors, but the 
simplicity of the matrix P makes it quite convenient to use (4) for calculation. On the null hypothesis 
p = 4 constantly. Table 1 shows values of h and the sample size n for which the probabilities of errors 
of the first kind are at most a, where « = 0-05 and 0-01. In order to ensure that the Type I errors are at 
most @ for non-tabular values of n the larger value of h should be taken. For larger values of n rough 
interpolation will provide a sufficiently accurate value of h. 

The power of the test II depends both on the value of p after the change and the position of the change. 
If prob (y; = +1) is constantly equal to p so that the change can be considered as having occurred 
immediately, the probability that the null hypothesis is rejected in a sample of n observations is given 
by equations (1) and (5) with r = n. If the change occurs after the kth observation the value of p,, tobe 

i is given b 
oer mares P, = PI-*.PEDo, 


where P,, P, are the matrix P with p = }, p = p, respectively. 








aon = -«- oF be 


—_ he 

















Miscellanea , 525 


Table 1. Values of n and h 























| 

a=0-05 a=0-01 

| | 

n h n | h n h 

| 

|. Sa eS. - -e 20 12 

26 1l 83 20 27 14 

|. Se 12 _ oe ae 35 16 

| ope 100 =| 22 43 18 

| ea 14 119 | 2% 53 20 
47 15 iso 6h lj 8 64 22 

| 54 16 161 | 28 76 24 

60 17 185 30 89 26 

| @ 18 103 | = (38 

| am |} —- 

| | 





In Table 2 the power of the test IT is compared with the power of the simple binomial test with approxi- 
mately the same probability of Type I errors for a sample of 50 observations. For test II to have prob- 
ability of Type I errors just less than 0-05 we need h = 16. The corresponding single sample test is 

0 


5 
‘Reject the null hypothesis if more than 31 of the y; are positive’, i.e. if LX y,;>12. The loss of power 


=1 


v 
from using Test IT instead of the single sample test is remarkably small. 


Table 2. Powers of the tests for different p 

















| 
p Test IT Single sample test 
0-50 0-039 0-032 
0-55 | 0-136 0-127 | 
0-60 0-336 0-336 
0-65 0-609 0-622 
0-70 0-844 0-859 | 
0-75 | 0-964 
0-80 0-996 0-997 
| 


0-971 | 





The same two tests are contrasted in Table 3, where their powers are shown for different positions of 
the change from p = 0-5 to p = 0-75. Here, however, the test II has an appreciably greater power than 
the single-sample test when the change occurs near the middle of the set of observations. Also shown in 
Table 3 are the powers of the single-sample test on the last 50 —m observations, when it is known that 
the change has occurred immediately before the mth observation. The differences between the power of 
this test and that of test II gives an indication of what is lost from the ignorance of the position of change. 

In order to illustrate the test we give an example constructed from tables of random normal deviates. 
A sample of forty observations was constructed, the first twenty having mean 5 and unit variance, and 
the last twenty having mean 6 and unit variance; these are shown in Table 4. Suppose that it is required 
to test the hypothesis that the mean is constantly 5 against the alternatives that an increase in the mean 
has occurred within the sample. The observations, x,, are shown in Table 4 together with y; = sgn (x; — 5), 
and the value taken by S,—min Sj. 

The greatest value, h, of S,— min S, in the sample of 40 is 17, which approaches the 1 % significance 
point given in Table 1 (for n = 40, the approximate 5% point is h = 14, the approximate 1% point is 
h = 18). This significance level can be compared with that obtained from other tests applied to the 











526 Miscellanea 


Table 3. Powers of the tests for different positions of the change 





Single-sample 
m Test II test on 
whole sample 


Single-sample 
test, m known 





0 0-964 0-971 0-971 
10 0-906 0-864 0-946 
20 0-733 0-625 0-894 
30 0-398 0-330 0-618 
40 0-122 0-122 0-244 
50 0-039 0-032 0-032 




















Table 4. Artificial sampling experiment 














Observation no. | 1 2 3 4 5 6 7 8 9 10 

Value of x; 3°95 5-96 6-22 5-58 4-02 4:97 3-46 4:29 4-65 5-66 
Y, = sgn (x,—5) -1 +1 +1 +1 -1 -1 -1 -1 -1 +1 
S,—min S, 0 1 2 3 2 1 0 0 0 1 
Observation no. ll 12 13 14 15 16 17 18 19 20 

Value of 2; 5:44 5-91 4:98 3-58 5-26 3-98 4-19 6-66 6-05 5:97 
Y; = sgn (x,—5) +1 +1 -1 -1 +1 -1 -—1 +1 +1 +1 
S,—min S; 2 3 2 1 2 1 0 1 2 3 
Observation no. 21 22 23 24 25 26 27 28 29 30 

Value of x; 7-14 6-22 4-76 6-60 5-72 4-88 5-44 5-03 5-66 5°56 
Y; = sgn. (x, —5) +1 +1 -1 +1 +1 -1 +1 +1 +1 +1 
S,—min S; 4 5 4 5 6 5 6 7 8 9 
Observation no. 31 32 33 34 35 36 37 38 39 40 

Value of x, 6°37 6-66 5-10 5-80 6-29 5:49 4-93 6-18 8-29 6-84 
¥; = sgn (x,—5) +1 +1 +1 +1 +1 +1 -1 +1 +1 +1 
S,—min S; 10 ll 12 13 14 15 14 15 16 17 














sample. The single-sample binomial test on the y’s has 26 positives, 14 negatives; on the nuil hypothesis. 
the probability of this or a larger number of positives is 0-04. The change in the mean causes the estimate 
of variance of the ~’s to be inflated, and a t-test fails to give significance. The computation required by 
test IT is so simple that it is unnecessary to record the y’s, or even the S,— min S;. An additional advan- 
tage of the test is that it gives an indication where the change took place; the position of the last zero 
of S,—min S;, is an estimate (of course, biased) of the position of change. Thus in the example we would 
suspect that the change had occurred near observation 17. 


3. GENERAL REMARKS 


In this section we comment briefly on some other possible methods for the problem of § 2 and related 
problems without investigating their properties. 

Another test for a change in one direction of a parameter from a known specified value may be obtained 
by analogy with the standard control chart process inspection scheme. The sample is divided into a 























Miscellanea 527 


number of subsamples of equal size and a statistic calculated from each subsample; the hypothesis of 
no change is rejected unless all the statistics fall within a certain rauge. The properties of this test are 
easy to evaluate and the number of subsamples and the permissible interval for the statistics can be 
chosen to control the errors. The test is also easy to apply and it is frequently useful in rough work. 
However the temporal ordering of the observations enters only into the division into subsamples, 
and it is of interest to examine whether it is advantageous to employ a slightly more complicated test of 
the form ‘ Reject the null hypothesis if any k of the statistics calculated from m consecutive subsamples 
fall outside an interval J, or if any one falls outside a wider interval J’’ (cf. Wilkinson, 1951; Tippett, 
1931). 

The control chart procedure can also provide a test for the two-sided case where the change from the 
known value can be in either direction. A test based on the mean path of a cumulative sum similar to 
test I is a truncated sequential test (Rao, 1950). Another case that needs to be considered is where the 
initial value of the parameter is unknown. 


I wish to thank Dr D. R. Cox for a number of discussions on the subject of this paper, and the Director, 
Mathematical Laboratory, Cambridge, for permission to use the EDSAC. 


REFERENCES 


Paag, E. S. (1954). Biometrika, 41, 100. 

Rao, C. R. (1950). Sankyha, 10, 361. 

Trerett, L. H. C. (1931). The Methods of Statistics. London: Williams and Norgate. 
Witzrnson, B. (1951). Psychol. Bull. 48, 156. 


A paradox in statistical estimation 


By ALAN STUART 
Division of Research Techniques, London School of Economics 


1. Sundrum (1954) has recently shown to be incorrect the intuitive idea that the more efficient of 
two estimators of a parameter necessarily provides the more powerful test of a hypothesis concerning 
that parameter. This note discusses a similar paradox which arises in a problem concerned purely with 
estimation: given an estimator u of a parameter @ in a multiparameter distribution, one does not 
necessarily improve its effiviency by substituting true parameter values into u to replace estimators of 
them. 

2. We shall consider the case where there is only one other parameter, say ~, and where we have 
a consistent estimator of 0 

t = f(s, p), (1) 
where s is a function of the n observations only. If ~ is unknown, we reduce ¢ to a function of the observa- 
tions only by substituting for ~ a consistent estimator of it, say m, giving 


u = f(s,m). (2) 
From (1) we have, to order n-, 


at\? 
V(t) = (=) V(s). (3) 


From (2) we have the corresponding result for a function of two random variables 


du\? u\? Ou Ou 
V(u) = (=) V(s)+ (=) V(m) + 2———— O(8,m). (4) 
Since all the derivatives in (3) and (4) are to be taken at the true parameter point (0, ), the first term 
orf the right of (4) is equal to (3). Thus 
du\? * Ou du 
7 ais = BP Sn 7 tee Set 3 
V(u)— Vit) (~) V(m) +2—— Ole, m) (5) 
(5) is not generally positive, although it must be so if s and m are uncorrelated. In general, their 


correlation must be taken into account before the effect of substituting parameters into u can be 
assessed. It is the correlation term in (5) which resolves the paradox. 





§28 Miscellanea 


3. An example in which (5) is negative is provided by the estimation of the correlation parameter p 
in a bivariate normal distribution. A consistent estimator is provided by the sample correlation co- 
efficient r, which has a large-sample variance, 


1 
Vir) = = (1—p?)?. (6) 


Although r is the maximum-likelihood estimator of p when all five population parameters are being 
simultaneously estimated, it is not an efficient estimator when the population means and variances 
are known, which is the case we shall consider. One might therefore expect to improve its efficiency by 
substituting the known population means and variances for their sample correspondents. The new 
estimator thus obtained is 


l 
; ne te Ha) (Ye Ma) 


r= — : (7) 


We find E(r’) = p, (8) 





n*o} 03 E{(r’)*} = E{X (a4 — fy)* (Ys — Ha)? + 2 (4 — Hy) (Yi 2) (5 — 1) (Ys — H2)} 
i i+j 


= Nflgg + N(n— 1) p? 703. (9) 
Since fg, = (1 + 2p?) of02, we have from (8) and (9) 


1 
V(r’) = —(1 +p"), (10) 


n 


and comparing (10) with (6) we see that, far from improving the accuracy of estimation, the substitution 
: SA : i—p*)* 
of true parameter values for random variables in r has multiplied efficiency by a factor (i= : , which 
tends to zero for large p*, and is 1 only when p? = 0. 1+p 
4. This paradoxical effect can be examined through (5) by restricting our attention to the case when 
both population means are zero and both variances equal to o?, which remains unknown. Our estimator 


(7) then becomes 





1 
— Lay 
tm ot 





ao? 
and with a common population variance, we should in our correlation coefficient use a pooled estimator 
of o?, giving 
oe 
n i iYi 
u= Soo qanaber rg ° 
— DdS(22+7? 
In Fi (xq Yi) 
We shall see below that uw has the same efficiency as r. 
In the notation of (1) and (2) we have 


t= of and u=-—, 
le 1 ee 
where s=-Lxzy, w=o* and m= — Dd (ar3+y}). 
ni 2n i 


We now require ao ' 
V(m) = wiite| 


(11) 
C(s,m) = 





2po% | 


n 


Also (*) =(-) =~. 
om 0 m 10 oe 


(12) 











Sul 


tian . de 





) 


SN @ WS 

















Miscellanea 529 


Substituting (11) and (12) into (5), we find 


p?at 1 p \ 2po* 
Viu)— = 2s 2 =f ue ©. all 
(u) — V(t) ote (1+p )+2(5) ( 
p? 
= — (p*— $3). 
a (p?— 3) (13) 
(13) is negative whenever p? + 0, confirming the result on efficiency given above. 


5. Finally, we may confirm that uw is as efficient as r. Using the general formula for the large-sample 
variance of »* ratio estimator, 








V(s) V(m) C(s,m) 
a3 2 a 
Vu) = Cae}? * (Eomy* ” Ele) Em)” ag 
Using (8), (10) and (11) in (14), we obtain 
_ of +e) 1 anil 
V(u) = at 2 + (1+p?) 2-| 
== (1-p%) (1s) 
ny > 
agreeing with (6). 
REFERENCE 


Sunprvum, R. M. (1954). On the relation between estimating efficiency and the power of tests. Bio- 
metrika, 41, 542-4. 


Cumulants of a transformed variate 


By G. 8S. JAMES 
University of Leeds 


1. Suppose that x is a variate whose pth cumulant, denoted by «,,, is of order v-?*1, where v is some 
‘large’ number. One consequence of this is that x is approximately normally distributed with a variance 
of order v-!. Thus if y is some well-behaved function of z, say y = f(x), then y will also be approximately 
normally distributed with mean and variance given by 


Ky = €y =f(u) (1+0(v-)], (1) 
Key = vary = [f’(u)}* var2{l + O(v-)], (2) 


where yz is the mean value of x. These are of orders v° and v— respectively. 

It is by no means obvious, however, that the pth cumulant of y, denoted by «,,, is of order v~?*! for 
general values of p, and not merely for p = 1 and 2. The object of this note is to show that this is true, 
at any rate formally. (The result will be true in reality, as well as formally, under the same sort of 
conditions for which (1) and (2) are really true.) 

Suppose that f(x) can be expanded formally in a Taylor series round yu; there is no loss of generality 
in taking ~ = 0, so that this series is 

Y = Cote, X+Cy2*+..., (3) 


where we suppose that Co, c,, ¢,, ... do not depend upon v. It is quite easy to show, by formal expansion 
starting from (3), that the pth moment of y about its mean is of order v—#+l, where [k] denotes the 
integral part of k. But on conversion to cumulants the higher order terms always seem to cancel, to 
such an extent that «,, is of order v-?+!, We now give a proof of this result. 


THEOREM I. If a variate x possesses cumulants «,, of all orders, and if Kp. = O(v-P+) (op = 1,2,...), 
and the cumulants of y = f(x) are calculated on the basis of a (possibly formal) Taylor expansion (3), 
where the c, do not depend upon », then K,, = O(v-?*) (p = 1, 2,...). 

Without loss of generality we may assume that c, = 0. We may then write y = &z,, where z, = ¢,2" 
(r = 1, 2,...). Writing 4(.)(7; ... 7)) for the moment G2, « Zr) (some of the r, may be equal) and Kj,)(7; ... 7p) 








530 Miscellanea 


for the corresponding cumulant, we have, by the properties of moment-generating and cumulant- 
generating functions, 


Kyyt | Keyl? Ma by  Péy* 
a a = log} 1+—— I! +—— Ti WP eda 
we: he Tet balt) +5 = Ea Min P8) + | 
== = Birk) +5, 2h t, Ka)(78) + -- (4) 
where t, = t, = ... = t and the summations are over 7,8, ... = 1,2,.... Hence 
Koy = Uky)(71 eee Tp)s (5) 


summed for 7,,...,7) = 1,2,.... 

Thus to prove the theorem it is sufficient to show that x,,)(7, ... 7)) is of order v-?+" or lower. To demon- 
strate this, suppose rather more generally that x,, ...,x, is a sample of n independent values from the 
x distribution, and let z, = c,(Za,)" (r = 1, 2, ...). This agrees with the earlier definition of z, when n = 1 
(and x, = x). The z, can be regarded as statistics of the sample z,,...,2,, although of course they are 
functionally related. Each z, is a homogeneous polynomial symmetric function of the sample values, 
of degree r, and the coefficient of 2} ... 22 in z, (with La, = r) is 


r! 
Pe Bea, --. Gy), (6) 


a!...Q, 
where BoQ --. dy) = Cy (7) 
A corresponding result holds for Fisher’s statistics k,, but with 
(—)-¥(a—1)! 
n(n—1)...(n—a+1) 





Bu, --- 4g) = (8) 
The factors Bi, and B,) happen not to depend on the complete detail of the partition (a, ...a@,) of the 
number 7; but I have shown (James, 1955) that Fisher’s rules for obtaining the sampling cumulants of 
k-statistics, described in Kendall’s book (Kendall, 1947), may be adapted to any system of statistics 
2152, --- (2, being a homogeneous polynomial symmetric function of degree r in the observations) by 
merely replacing the factors B,,, occurring in the evaluation of the ‘pattern functions’ by the factors 
By), where B,,) is the coefficient of x} ... z?* in z, divided by the appropriate multinomial coefficient. 

The rule of Fisher which is of paramount importance for our proof is the one which states that, in 
finding cumulants of k- (or z-) statistics in terms of population cumulants, we are to neglect any array 
which splits up into two or more disjoint blocks. Now any array which does not fall into disjoint blocks 
may be built up column by column in such a way that at each stage the new column does not form an 

x x 
x x Xx 
in the order 1234). Thus, if «,,..., a, are the numbers of non-zero elements in the Ist, ..., pth columns 
and 7 is the total number of rows in the array, we have 


isolated block by itself. (For example, the columns of % may be taken in the order 1324, but not 


TSA +(%g—1)+...+(4,—1) = La,—p+l. (9) 


Now if this array is one of those contributing to the coefficient of K,... K,, in Kg(7; -. - 7) (so that Zt; = Zr,) 
then the corresponding term is of order v&\-t+”) = y-2"4+7; for the numerical factors and pattern func- 
tions for each separation of the array do not depend upon », while each Kr, is of order v-* +1, But (9) 
shows that 

—ir,+7T< —XU(r,—%,)-—p+1< —p+1; (10) 
for a, is the number of parts in a partition of r,, and so does not exceed r,. Thus this particular term of 
Ka)? -«- Tp) is of order v-?** or lower. Hence the same is true of K(7, ...7 p) itself, and finally, by (5), 


2. By the use of multivariate sampling rules more general results than Theorem 1 can be proved. 
Perhaps the most general is Theorem 2. 


THeoreM 2. If y' = f'(z,...,2”), y2 = f%(x!,...,2”),... are functions of the variates* x!,...,2?, 
formally expansible in the forms 


yr = + Lech xi + Leh, wiz +..., (11) 
* The superfixes are indices, not exponents. 











Miscellanea 531 


and if the pth-order cumulant, «i---, of x, ..., 2‘ is of order v-+! for4,, wt, = 1,...,pandp = 1,2,..., 
then the same holds for the cumulants, x--“e, of the y*. (Here we use a slightly modified form of the 
‘tensor’ notation suggested by Kaplan (1952).) 

An outline of the proof is as follows. If z,,z,,... denote the quantities z',...,2?, xa, zlg2,...,2?a?, 
x'z1z1,... written in some convenient order, then (11) may be rewritten in the form 


y* = Xe™z,, (12) 
whence we easily derive 


Kp’ = Loh... chooK (8, ....85), (13) 


Biy -++98p 


where Ky)(8; -..8p) is the mixed cumulant of Zs,9+++s%s5 Lhus it suffices to show that all the K,.)(8; +++ 8p) 
are of order v-/*! or lower, for p = 1, 2,.... The proof is completed in much the same way as before, but 
in order to see clearly the application of the results of my other paper the z’s should be relabelled as 
follows: z' = xt, z = zizi,.... 

3. Anexample. Johnson & Welch (1939) have found by direct calculatior., using Stirling’s asymptotic 
expansion of the gamma function, that if y? is the sum of squares of v independent standard normal 
deviates, then the cumulants of y, as far as the sixth, are given asymptotically as follows: 


Kiy™~ vi, Key™~ +, 
Kgy~ tv-4, Kay~ Per, (14) 
Ksy~ — Fert, Key ~ —Fhv—. 


Now if we write x = x2/v—1, y = /(1+2) = y/./v, then K pz is of order v-?*!, so that Theorem 1 shows 
that Koy is of order v-?+! or lower and Kx is of order v-#e+1 or lower. Thus the odd cumulants in (14) 
are of the order we should expect, but the even ones (apart from the second) seem to be a whole order 
1 than demanded by Theorem 1. I have been unable to prove that «,, , = O(v~?) for general values 
of p (22). 


REFERENCES 


James, G. 8. (1955). On moments and cumulants of systems of statistics. (Not yet published.) 

Jounson, N. L. & Wetcu, B. L. (1939). On the calculation of the cumulants of the x-distribution. 
Biometrika, 31, 216-18. 

Kaptan, E. L. (1952). Tensor notation and the sampling cumulants of k-statistics. Biometrika, 39, 
319-23. 

KENDALL, M. G. (1947). The Advanced Theory of Statistics, vol. 1, 3rd ed. London: Charles Griffin 
and Co. 


The likelihood ratio test for Markoff chains 
By I. J. GOOD 


1. The present paper is virtually a footnote to that of Hoel (1954), who, using and acknowledging 
the methods of Bartlett (1951), constructed « likelihood ratio test for the order of a Markoff chain. If 
H, is the hypothesis that the chain is of order v, then Hoel tests H,_, against (or rather within) H,. 
Here we make a simple generalization so as to test H - within H,,, and we relate the work to some previous 
work. 

2. Let 2,2,...,2y be a sequence, GS, of observations, each observation being capable of taking one 
of t values denoted conventionally by 1, 2,...,¢. Let H, be the hypothesis that S is a Markoff chain of 
order v (v = 0,1, 2,...). Note that H, means that S is a random sequence. We may obtain uniformity 
in expression by defining H_, to mean that G isa ‘perfectly’ random or ‘equiprobably’ random sequence. 
H_, is the only H - that is a simple statistical hypothesis. Clearly H, implies H,, whenever p< v. 

The likelihood ratio test for composite hypotheses may be expressed in the following form. Let H 
and H’ be hypotheses such that H implies H’, so that H’ at least is not a simple statistical hypothesis. 
Let E be an experimental result and let 


A = max P(E | H*)/max P(E | H’*), 











532 Miscellanea 


the maxima being taken over all simple statistical hypotheses H*, H’* belonging to H and H’ respectively. 
A is the likelihood ratio statistic for composite hypotheses and we may say that it tests H within H’. 
We may call H the null hypothesis whether or not it is simple and (within orthodox statistics) A can be 
used for the rejection of H. 

Hoel (1954) used A for testing H,_, within H,, (v>1), and, with suitable conventions, his result applies 
also for vy = 0. Bartlett (1951) used the likelihood ratio statistic for testing completely specified chains 
of order v— 1 within H,. Since H_, can be regarded as a completely specified chain of any order, Bartlett’s 
work included tests for H_, within H, (v = 0,1, 2,...). Similarly, a completely specified chain of order 4 
is also one of order v—1 (if ~<v), so that Bartlett’s work applies for testing any completely specified 
chain of order uz within H,, where w<v. 

We shall here generalize Hoel’s work in order to test H, within H, where ~<v. Only when yp = — 1 
does the generalization overlap with Bartlett’s results, since the only H, that is completely specified 
is H_,. (A completely specified statistical hypothesis is the same thing asa simple statistical hypothesis.) 

3. Associated with S is the corresponding cyclic sequence S defined by regarding the first element 
of S as immediately following its last one. We shall always denote properties of S by placing a bar over 
the corresponding algebraic symbol relating to G. 

A sequence of v consecutive observations is called a v-sequence. Let n,_,,, or n, for short, be the 
number of v-sequences in S which are (r;,73,...,7,) = r. Let %, be defined similarly. Let 





N- 2/N- 
2 = E (m— 2") / aot wa, (1) 

; " 
y? = D(n,—Nt-)?/Nt” (v>1), (2) 

Tt 
i ye = Ws = 0, (3) 
K = = 22n,logn, (v>1), (4) 
K, = 2.N—v+1)log(N—v+1), K_,=09, (5) 
K, = 227, logr, (v>1), (6) 
r 

K,=2NlogN, K_, =0, (7) 
VYZ=y'-y*_,, ete., (8) 
Vy = Vi - Wat» ete. (9) 


(The logarithms are to base e and 0 log @ means 0.) 
It is known that if H_, is true, then y; and yf? do not have asymptotically gamma variate (chi-squared) 
distributions, but Vy7, V7, V2, and Vy? do (see Bartlett, 1951; Good, 1953). Moreover (precisely), 


Ey? =@y=r-1, (10) 
while Vy? and Vi? have V?” degrees of freedom (v> 1), and V2y? and V2" have V%" degrees of freedom 
(v>2). 

Hoel defined A as the ratio of the maximum likelihood given H,_, to that given H,, and obtained the 


asymptotic distribution of A, given H,_,. Let us denote by A,,y the ratio of the maximum likelihood 


given H, to that given H,, (4 <v). Then Hoel’s A is our A,_,,, and obviously An. , is the product of y—y 
of Hoel’s 2’ s, namely 
Any =A 


The expression given by Hoel for —2logA,_, ,, may be written in the form 


pw prt Ager, pte Ay-ry: (11) 


—FisgA,_. , = VE, = 1,2,3,...). (12) 


With our conventions this equation is also valid for v = 0. 

Hoel’s result can be stated in the form: When H,,_, is given, V?K,.,, has asymptotically a gamma variate 
distribution with V?¢’+! degrees of freedom if y> 1 and we may add that for vy = 0 it has asymptotically 
a@ gamma variate distribution with t— 1 degrees of freedom, since it then reduces to the likelihood ratio 
test for ‘perfect’ (equiprobable) randomness against randomness in general. Clearly 


—2loga,, = —210ga,, y+1— ++. — 2logA,_,,, 
V*K yi + V°K, +... +V* Kye 
= VKy41:-VKyu.  (—1<p<?). (13) 





Miscellanea 533 


If we could assume that the variables —2logA,_, , (v = 0,1, 2, :..) were asymptotically independent, 
given H,, it would follow that —2logA,,, has a gamma variate distribution with degrees of freedom 


Ve Ve (uw >0), 
(14) 
Ve (he =-1), 


and we could use these results for testing H ,» Within H,. Unfortunately, the independence does not seem 
to be easy to prove. Nevertheless, the result just stated is correct and can be proved by precisely the 
same method as Hoel used, except that his suffix 7 is to be replaced throughout his proof by a sequence 
of v—yp suffixes. It is unnecessary to repeat the argument since the modifications are entirely trivial. 
The conjecture that the variables —2logA,_, , are asymptotically independent is strengthened by the 
knowledge that the above deduction from it is correct. 

4. The value of the generalization of Hoel’s results is that there may not be a significant distinction 
between ‘adjacent’ hypotheses in the sequence H we H,+1,---,H,, yet H, may be clearly rejectable by 
the statistic VK,,,—VK,4:. If we put ~ = 0 [uw = — 1] we have a test for randomness (‘perfect’ ran- 
domness) within Markovity of order v. If 4 = 0 and v = 1 we obtain a test that is a special case of the 
likelihood ratio test for contingency tables. (See, for example, Wilks, 1946, p. 220, where, however, there 
is a minor slip in that the expression given as A is really 1/A.) 

5. There is clearly a strong analogy between the expressions VK,,,—VK,,4, and V¥}..—Vyi.:. 
When #_, is true the latter expression also has asymptotically a gamma variate distribution with the 
same number of degrees of freedom as the former expression, since (as shown by Good (1953)) the 
variables V2? (v = 1,2,...) are asymptotically independent, at any rate when ¢ is a prime number. 
This analogy is no coincidence since Vy? is the asymptotic form of VK,., — VK, when H_, is true. 

6. We may also use the cyclic definitions. When H,, is true, VK,,,—VK,,, has asymptotically a 
gamma variate distribution with a number of degrees of freedom given by (14). The cyclic definition is 
mathematically simpler than the non-cyclic one and makes the checks 


un, = a RE, NE in, = Np, ty 
; Ty r; 
precise. } 

7. Not much is known concerning the accuracy of the asymptotic formulae when N is specified. 
When testing H_, the psi-squared statistic has the advantage that its expected value is precisely known, 
but in principle this statistic may well be less powerful than the likelihood ratio. Unless N is very small 
there is probably little to choose between the cyclic and non-cyclic forms of the latter. 


8. Another statistic that would be worth consideration would be VL,,,—VZ,,, (and its cyclic 
fe , wh 
or L,=2Zlogn,! (v2), L, = 2log NM}, L.,=0. (15) 
rt 


When v = 1 and uz = 0 this statistic reduces to V*Z,, which is minus twice the log likelihood of the 
n,, r,'8, regarded as forming the interior of a contingency table for which the marginal totals are assigned 
and independence is assumed. 
REFERENCES 
BartTLeEtTT, M. S. (1951). The frequency goodness of fit for probability chains. Proc. Camb. Phil. Soc. 
47, 86-95. 


Goon, I. J. (1953). The serial test for sampling numbers and other tests for randomness. Proc. Camb. 
Phil. Soc. 49, 276-84. 


Hort, P. G. (1954). A test for Markoff chains. Biometrika, 41, 430-3. 
Wiss, S. S. (1946). Mathematical Statistics. Princeton. 


Exact forms of some invariants for distributions admitting sufficient statistics 


By V. S. HUZURBAZAR 
University of Poona, India 


1. INTRODUCTION 


The concept of distance in statistics is due to Mahalanobis (1936) who defined the distance between two 
multivariate normal populations. Later Jeffreys (1946, 1948) defined a wider class of invariants of 
probability distributions which may be regarded as providing measures of distance between two 





534 Miscellanea 


probability distributions, in general. If p and p’ are the density functions of two probability distributions 
of a variate x, then the expressions defined by Jeffreys 


lex | | pm — pum mde, (1) 


and J =| (w'-p)log® ae, (2) 


are all positive definite and invariant for all non-singular transformations of the variate and the para- 
meters. These expressions, therefore, provide measures of distance between two distributions. For 
discrete distributions p and p’ are simply the probabilities that 2 takes a particular value z. Extension 
to multivariate distributions follows immediately. 

An important case is the distance between two distributions having the same mathematical form but 
with different sets of values of the parameters. If the corresponding parameters in the two sets differ by 
infinitesimals, we get the differential forms of the invariants J,, and J. The invariants J, and J have 
interesting mathematical properties and have been used by Jeffreys in stating the prior probabilities of 
parameters in his theory of estimation and tests of significance. Though these invariants have not yet 
attracted the attention of the frequency theorists of probability, it is likely that they may find applica- 
tions in their work also. 

Jeffreys has obtained the exact forms of the invariants J, and J for the univariate and bivariate 
normal distributions, the Poisson and the binomial distributions; and for these distributions the exact 
forms come out as explicit functions of the parameters of distributions. 

The object of this paper is to prove, in general terms, a remarkable property that for all distributions 
admitting sufficient statistics the exact forms of J,, (m even) and J come out as explicit functions of the 
parameters of distributions. It may be mentioned here that the properties of I,, (m+ 2) have so far 
remained uninvestigated, the form of J,, being then very complicated. 


2. DISTRIBUTIONS ADMITTING SUFFICIENT ST.ATiSTICS 


We shall take the most general form of distributions admitting sufficient statistics as given by Koopman 
(1936), 


S(x,a;) = exp (= usa) (2) + A(x) + Ba}, (3) 


where, for brevity, (~;) denotes the set of p parameters (a, @,...,@,); u, and B are functions of the a;, 
and v, and A are functions of x. For multivariate distributions (x) is to be replaced by the set of variates 
(245 Legs ---9Lq)- 


Since | f(z, a,;)dz=1, for all a;, we have 
fex { Up(2;) V_(Z) + A(z} dx = exp {— B(a,)}. (4) 


Now the u,(a,) are p independent functions of the p parameters a; We can express the @, inversely as 
functions of the u,’s. Then B(a,;) can be expressed in terms of the u,’s as 


B(a;) = b(u,). (5) 
Then (4) becomes 


[ex {= Up(&s) Uy(2) + A(a)| dz = exp {—6(u,)}. (6) 


3. Exact rorm or I, 


Let. f(x, %;) = exp { Upz(&;) Vz(@) + A(z) + Ba) , (7) 


and fle,aj) = exp {= tig(t)) vp) + A(x) + Ba), (8) 











so the 
We h 


Writi 


From 


Th 
hand 


Co 


We v 


Here 


wher 
tivel: 


Then 














Miscellanea 535 


so that f(x, «;) and f(w,a@;) have the same mathematical form but different sets of values of the parameters. 
We have 


fim | (Ufa) — Lie af) B2de 


=2- 2[ isc, a;) f(x, a5) }# dax 


p 
=2-2fexp[ © tqusla,)+ dua} vale) + Ale) + 48a) +4812 dx 
= 2—2exp ($B(a1,) + $B(as)} | exp [2 (Juglce,) + $l} vg) +4(2) | dz. (9) 
=1 


Writing 4u,(a;) + };,(a}) for u,(a;) in (6), we have 


fox [= {dee g (as) + $reg (O5)} vg (x) + 42) dx = exp [— b{}u,(a;) + $u,(a5)}). (10) 
From (9) and (10) we have 


I, = 2—2exp[$B(a;) + $B(a5) — b{du,(a;) + $ee,(a5)}]. (12) 


The curious point to be noted is that the function A(z) remains unaltered in the integral on the left- 
hand side of (10) which enables us to evaluate that integral explicitly in virtue of (6). 


3-1. Illustrative example 


Consider the Type III distribution 
fl a? e-4% yP-t 
F x, a, p) a 
Ip) 


We write f(x, a, p) = exp{—ax+plog2—logx+ploga—logI(p)}. (13) 





(0<2<0). (12) 


Here u, = a, u, = p, B = ploga—logI(p). Expressing B in terms of u, and us, 


B= u,logu, — log I'(ug) = b(u,, ve), 


(5* 7 i (Fe ”) gt 100 r(?37) ' 











2 2 2 2 
I, = 2—2exp [ w loga—}logI'(p) + tp’ loga’— $log I(p’) — (57) log ‘es ) +log r(? = ) 
/ 
p+p’ 
ar(?>*) ahrg’tr’ 
ee J (14) 
{1'(p) T(p’)}4 (aeyer"" 
9 
4. THE EXACT FORM OF J 
a } (flee, a) ~f(e, a} loge, a) — logs x, a)} dx 
2 
* Fa {ux(j) — Ui %,)} (| v(x) f (aw, a5) dae — fostorgee ;) as| 
'D 
= & {uy(arj) — ur(a,)} {H’og(e) — Moy(a)}, (15) 
k=1 


where E’v,(.c) and Ev,(x7) denote the expectations of v,(7) when the parameters are a; and a; respec- 
tively. For brevity write 
Ki, = £’v,(x), HL, = Hv,(.v). 


p 
Then J= XZ {u,(aj) —u,(a;)} (A, — E,)- (16) 
k=1 


35 ‘ Biomet. 42 











536 Miscellanea 
7, and H, can be obtained as follows. We have 


i laste, a,) = & Oe ota) +m 


1 Oa, da,” 
HH 0 — 
Since E oa, log f(x, a;)} =0, 
P, Ou, 0B 
we have — E,+— =0. 17 
k=10a, *” da, i 


Setting r = 1, 2,...,p in (17), we have p simultaneous linear equations to determine the E,. Then E; are 
obtained by writing oj for a; in #,. Thus (15) and (17) enable us to express J explicitly in terms of the 
parameters. 


The above formulae are greatly simplified if we take the u, as parameters and express B(«;) in terms 
of them. Then (17) becomes simply 





eB eB 
E,+— =0, sothat #7,=-—-—. 
Ou, ou, 
oB’ @B 
From (16), J = 5 (— uy) (-F5 +). (18) 
= k k 


Writing u;,—u, = du,, the differential form of J is 


P. @B 
J=-2 § —_— du,du,. 19 
ka1l=10uz,du, * © (19) 


5. THE EXACT FoRM oF I,, (m EVEN) 
For brevity write 
S(z,a)=f, f(v,a;)=f’, u,(a;)= uz, Blaj)=B’, ete. 


Now a = [ism sim ma 


=f —fl™)"dz (since m is even) 


={{% z (- (" ) semen) a 
r=0 

















= 3-17 (") "a (20) 
r=0 r 
where Ss = | pmo pm 
2. 
=[exp{ § ( = uz +— us) vel) + Ala) +——" B+ = “Bh as 
m—r r - 
= exp ( pe +<B) [exp {& ("5 “ui += wy) vu(z)+A(a)] dx. (21) 
m= r 
Writing Uz +o ue for u, in (6), we have 
§ m—r m—r r 
[exp ("= ue + “ us) ola) A(z)} de = = exp{ - o( Ue +— u)| ® (22) 
k=1\ m m m 
From (21) and (22), 








A, = exp{™—" a4 = B- -»(* 








Then 


Cu 
eval 


JEFI 


JEF! 
Koo 


Mas 








es 





Miscellanea 537 


m—rTr 





™ -_- 
Then from (20), In = X (-1)$ (™) exp (Roar Bo i+“ a) ; (23) 
r=0 r ™m ™m ™m 


m 
Curiously again, the function A(x) remains unaltered in the integral in (22), which enables us to 
evaluate that integral explicitly in virtue of (6). 


REFERENCES 


JEFFREYS, H. (1946). An invariant form for the prior probability in estimation problems. Proc. Roy. 
Soc. A, 186, 453. 

JEFFREYS, H. (1948). Theory of Probability, 2nd ed. Oxford: Clarendon Press. 

Koopman, B. O. (1936). On distributions admitting a sufficient statistic. Trans. Amer. Math. Soc. 
39, 399. 

Manatanosis, P. C. (1936). On the generalized distance in statistics. Proc. Nat. Inst. Sci. Ind. 12, 49. 


35-2 





[ 538 | 


REVIEWS 


Demand Analysis. A Study in Econometrics. By Herman WOzD, in association with 
Lars JUREEN. New York: Wiley and Sons (Stockholm: Almqvist and Wiksell). 
Pp. 1953. xvi+358. 56s. 


Originating in a study of consumer demand for food in Sweden undertaken in 1938 by Prof. Wold and 
Mr Jureen, this book sets out to present a self-contained account of both methods and results. It 
includes a report on the empirical findings, based on family budget surveys and market statistics, a 
section giving a survey of methods in non-technical language, and sections dealing in more concentrated 
and technical manner with some aspects of economic theory, stochastic processes and regression 
analysis. 

The theoretical sections are of considerable incidental interest for econometrics generally, although 
the developments, on an abstract level, are occasionally only remotely relevant to the reported 
empirical study. 

The author states that one of his principal aims was to justify the use of the traditional method of 
least-squares regression, via the use of economic models of so-called recursive type. His main argument 
rests on the notion of causality in consumer demand situations. With this as a guide to the method 
of application, he obtains, among others, the result that the regression method for the estimation of 
recursive systems is unbiased and consistent on the assumption of non-correlation between the 
explanatory variables and the disturbances, irrespective of whether there is inter-correlation between 
these disturbances. A generalization of the classical formula is derived for the standard error of a 
regression coefficient which allows for intercorrelation. 

The wide scope of recursive models is indicated by a proof that any set of time series has a formal 
representation in terms of a recursive system. 

However the causality argument is regarded (and it is certainly not a straightforward one), the 
regression method would still appear on Wold’s showing to compare favourably, in demand analysis 
at least, with other methods which have been proposed. 

The book is very fully annotated; there is an extensive bibliography (although several of the dated 
references in the text are omitted), and important sets of exercises accompany the theoretical parts. 
Based on years of research into the foundations of the subject, this is a very thorough work of scholarship. 


F. G. FOSTER 


A Study in the Analysis of Stationary Time-series. 2nd edition. By Herman Wo p, 
with an Appendix by Peter WarirtiLe. Stockholm: Almqvist and Wiksell. 1954. 
Pp. viii+ 236. 42s. 


As far as the main text is concerned, the present edition differs little from the first, published in 1938, 
the only change of major importance being the replacement of appendices A and B in the first edition 
by two new ones. 

In the first appendix, Prof. Wold makes short but useful comments on certain parts of the text, 
mainly in the light of recent work. Appendix 2, comprising 32 pages, is written by Dr P. Whittle and 
is devoted to a survey of recent advances in the subject. 

The first three chapters of the book dealing with Prof. Wold’s fundamental contributions to the 
theory of discrete stationary schemes are by now familiar to most statisticians and need no further 
comment. Chapter Iv, on the application of stationary schemes to experimental data, has not been 
as successful in withstanding the passage of time, and the discussion tends to be outmoded in view of 
the advances made in the testing of specific hypotheses and in the problem of estimation. 

Chapter 1 of Appendix 2 is devoted to the contributions of Cramér, Kolmogoroff and Wiener to 
spectral theory and Chapter 2 to a synopsis of Whittle’s own work in the theory of inference, together 
with some miscellaneous results in distribution theory and periodogram analysis. To attempt even a 
cursory review of the field in so short a space is not practicable, and the result is a one-sided exposition 
of the subject. Even so, the treatment is interesting and some new results are introduced. 











eee 





Reviews 539 


The emphasis laid throughout on the mathematical and structural, rather than the statistical aspects 
of time-series analysis, is likely to make the book of greater utility to the mathematical statistician than 
to one who is interested in the applications to economics, physics, meteorology, etc. The latter is also 
likely to view with extreme pessimism the warnings made at intervals throughout the book about 
the reliability of the information provided by the analysis. Briefly stated, the argument is that signifi- 
cance problems are complicated by the fact that one is determining the whole structure of a series from 
a realization or sample and also that the attachment of quantitative significance to the sample func- 
tions is not possible since they are conditioned by the size of the statistical masses to which they refer; 
in particular, there is arbitrariness in (a) the time unit of collection of the data, (b) the region in space 
to which this unit refers. A great deal has been done to remove the first objection by realizing the 
futility of testing individual serial correlations, and much progress has been made in discriminating 
between various hypotheses. The evidence up to the present seems to indicate that the problem is 
soluble in terms of the classical theory. 

The second objection is more troublesome but is not so serious as one might be led to expect. As far 
as (a) is concerned, the choice of time unit raises no special difficulty, since in many instances the auto- 
correlation properties of time-series are not in themselves of fundamental importance. Thus whilst 
collecting data at quarterly intervals instead of annually is likely to alter the structural form of the 
series, if it were required to estimate the correlation between two time-series, the ultimate aim would be 
to eliminate the effect of autocorrelation during the course of the analysis. The intrinsic autocorrelative 
properties of a series are likely to be more important in the problem of prediction, but recent work 
seems to suggest that as far as practical applications are concerned much remains to be done in this 
field. 

The difficulties raised by (b) are more serious, but this type of objection is latent in, and tends to 
restrict the conclusions which are drawn in many other problems in statistics. 

The need for further research is obvious, and Prof. Wold’s own work on this problem of quantitative 
significance is sufficient to raise the question that perhaps our assumptions in dealing with observed 


series are too stringent. G. M. JENKINS 


An Introduction to Stochastic Processes with special reference to Methods and 
Applications. By M. 8. Bartierr. Cambridge University Press. 1955. Pp. 
xiv + 312. 35s. 


Probability problems concerned with the description of changes with time—where we use ‘time’ in the 
operational sense—are as old as the theory itself. The entire development of the subject in the sixteenth 
and seventeenth centuries was due to the zeal and passion of gamblers and, among other queries, the 
probability of ruin was one which ranked importantly. Interest in this type of problem never faded cut, 
and we find it discussed by one after another of the great probabilists right up to the present day. The 
difference between the moderns and their predecessors is that problems involving the dynamics of change 
are now given a special title—stochastic processes—and are subject to a unified systematic attack. 

This systematic attack, which has its origins chiefly in the research of the Russian school of pro- 
bability and in particular in that of Kolmogoroff, has resulted in a great deal of elegant mathematics 
and a certain amount of useful statistical methods. Its great defect has been that the problem of 
stochastic processes has been lifted beyond the understanding of many statisticians who still continue 
to solve the problems in ad hoc fashion as they arise. 

In 1946 M. S. Bartlett attempted to remedy this state of affairs by publishing in mimeographed form 
the lectures which he had given in the University of North Carolina. Further clarification, in the dis- 
crete case, came with W. Feller’s book Introduction to Probability Theory, and now we again have Prof. 
Bartlett who attempts to avoid too much mathematical abstraction and gives a gereral survey of 
techniques and applications. Topics on pure theory covered by him are random sequences, processes in 
continuous time, limiting stochastic operations and stationary processes. The problem of the random 
walk, Markoff chains, renewal processes, stochastic convergence and spectral analysis are included. 
Under statistical applications we have a chapter on applications of the random walk, queues, population 
growth and epidemic models, a chapter on statistical inference in stochastic processes and a chapter on 
the correlation analysis of time series including harmonic analysis. 

This book is the best which has yet been produced on this specialized topic. It is lucidly written by 
someone who clearly sees the statistical implications of the abstract theory and may be read without 
undue difficulty by any student who has two years of university mathematical training behind him. 
It will undoubtedly be found indispensable by anyone wishing to learn something of a subject which 


has engaged the attention of nearly all probabilists in the past decade. F. N. DAVID 





540 Reviews 


Probability Theory. By M. Lozvz. New York: D. van Nostrand Company, Inc. 1955. 
Pp. xv+501. $12.00. 


This is a book on probability theory written by a mathematician who has himself made distinguished 
contributions to the subject during the past twenty years. Prof. Loéve divides his book into five parts. 
In the introductory part we have what might be called the intuitive approach to probability which is 
very often all that the statistician needs. In spite of the fact that the topics are not new they are set 
out in clear and pleasing fashion, and the corollaries to the various theorems proved, particularly in 
the laws of large numbers, will be fresh to those who are not thoroughly familiar with the French and 
Russian literature. The notion of chain dependence is also introduced at this stage. All this is intro- 
ductory. 

In Part One proper Prof. Loéve sets us sternly to work with 83 pages on ‘Notions of Measure 
Theory’. Here we have additive set functions, topological spaces, measurable functions and Lesbesgue 
integration. The two chapters which compose this part can be read fairly easily by a student who has 
taken a university degree course in mathematics, but the reasoning is rather condensed and may cause 
difficulty to students coming to it for the first time. In Part II, entitled ‘General Concepts and Tools 
of Probability Theory’, we have the development of probability laws, characteristic functions and 
distribution functions, much of which is in essence familiar although not perhaps the rigorous pre- 
sentation. 

With ‘Independence’, which is the topic of Part Three, Prof. Loéve gives us a characteristically 
thorough thrashing out of convergence in probability and central limit theorems. Little of this has 
appeared explicitly in English, and it is useful to have it all down in a moderately compact form 
(107 pages). From ‘Independence’ it is natural that we should turn to ‘Dependence’, the topic of Part 
Four. Here, apart from the clear and mathematically rigorous presentation, there is little which has 
not been covered by Doob and Bartlett, except perhaps the second-order random functions. 

This book will be excellent for those with a mathematical training who wish to specialize in pro- 
bability theory regarded as a mathematical discipline. It is throughout logically inevitable in its 
presentation and written as rigorously as only a French-trained mathematician can. There can be no 
better beginning for a young mathematical student than to be given this book. It will, however, not 
be helpful in the training of the statistics student. Many a promising statistician has been wrecked on 
the desert island of mathematical rigour, and those trained in modern mathematical techniques often 
find the necessarily heuristic approach to numerical data painful. However, for the class of student 
indicated the book is excellent and all providing. It is unfortunate that it has to be recorded that the 


price is $12. ¥. N. DAVID 


Statistical Inference. By Heten M. Waker and JoserpH Lev. New York: Henry 
Holt and Company. 1953. Pp. xi+510. 48s. 


This text-book on statistical method is designed for beginners with little or no mathematical back- 
ground. A good many mathematical formulae are, however, used (albeit always accompanied by 
careful explanation and practice exercises), and a student who has mastered such a text as Prof. 
Walker’s previous book, Mathematics Essential for Elementary Statistics, would find the going easier. 
In choice of subject-matter and general emphasis it is definitely ‘applied’, the application being to 
educational research ; this is not, however, revealed by the title. 

In scope it is fairly comprehensive. It begins with the general idea of statistical inference and an 
intuitive discussion of hypothesis testing. The concept of probability is introduced, then the binomial 
distribution, with a discussion of the closeness of the normal approximation. By use of a population of 
two classes only, a variety of problems of hypothesis testing and estimation are treated, the account 
including such topics as one- and two-sided tests, critical regions, errors of the first and second types, 
power functions, confidence limits—concepts which gain in clarity for the beginner by definition in this 
simple context. Populations with more than two classes are next studied, and then some of the more 
important concepts regarding distributions on a continuous variable. There follows an account of a 
class project in sampling which is designed to familiarize the reader with the statistics possessing one 
of the four continuous distributions of major importance: the Normal, Chi-square, Student’s and F dis- 
tributions. The method here is to plot the empirically derived distribution, superimpose the mathe- 
matical curve, and then introduce the student to the table of this function in place of its mathematical 
formula. With this basic material assembled, the subjects now treated include: inferences concerning 





ee 











= —— 


Reviews 541 


the mean or difference between two means; inferences concerning variances and standard deviations; 
analysis of variance; linear regression and correlation ; biserial, point biserial, phi, tetrachoric and rank- 
order correlation coefficients and the correlation ratio; the effects of measurement errors and the 
reliability coefficient; multiple regression and correlation; analysis of variance with two or more 
variables of classification; analysis of covariance; percentiles; transformation of scales; non-parametric 
methods. The chapter on non-parametric methods was written by Prof. Lincoln Moses. 

The treatment is concrete, and as far as possible basic ideas are explained by means of numerical 
examples. The student is drilled in computing techniques; a rather extensive set of twenty-three tables 
is an integral part of the text for this purpose, detailed instructions in their use being given. (The 
reader may on occasion be somewhat startled at being addressed directly from the text, exhorted to do 
such and such a computation.) A welcome feature of the book for educationalists should be the amount 
of illustrative numerical data actually taken from published work on educational research. 

The book is well referenced, and a list of authors for further reading is placed at the ends of chapters. 
Many exercises are set, with answers given at the back of the book. There is a subject index, an author 
index and a glossary of the mathematical symbols used. 

It is to be expected of a text with which Prof. Walker is associated that the exposition should be 
pedagogically sound, and there is evidence that a great deal of care was taken about the order and 
method of presentation of the material. The result cannot be said to be completely satisfactory. It is, 
for example, open to serious doubt on occasion whether the average student will be able to absorb 
simultaneously details of computing techniques and the basic principle of the statistical method under 
explanation. Occasionally, tuo, the style becomes so woolly as to present quite inaccurate information. 
Thus, on p. 14, two observations are defined to be independent ‘when information about one of them 
provides no clue whatever as to the other’. A reader may well be puzzled by this if he considers that 
in the case of two independent observations from the same (unknown) population, the first observation 
provides an unbiased estimate of the mean of the second. 

Despite some defects, however, research students of education and allied fields should find this text 


invaluable, not only as an instructional manual, but also as a general work of reference in statistical 


method. F. G. FOSTER 


Outline of Biometry and Statistics.. By C. I. Briss and D. W.Catnoun. New Haven, 
Conn.: Yale Co-operative Corporation. 1954. Pp. 272+xvi. $4.50. 


This book will be extremely useful to those who teach biometry, and to statisticians who work in a 
biological field or who wish to have biological examples to illustrate their teaching of statistics. Each 
chapter consists of a number of very concise explanatory notes, together with several illustrative 
worked examples and exercises chosen from a wide range of experimental material. 

The subjects covered include the binomial, negative binomial and Poisson distributions; chi-squared 
and the analysis of contingency tables; the normal distribution with the estimation of its mean and 
variance; the analysis of variance, factorial experiments and simple experimental designs; regression, 
correlation and association; bio-assay. The authors regret that the proposed length of course (about 
200 hours of students’ time) necessarily excludes a few other topics such as covariance, probits, dis- 
criminants, and sequential and other sampling methods. 

The professional statistician will find this book packed with information, such as short-cut formulas, 
precise statements of the limits within which approximate tests can be relied upon, and alternative 
parametric and non-parametric tests. A number of useful tables and charts are included. A long list 
of references enables the reader to consult the original papers whenever necessary, except for work yet 
unpublished. The exposition is almost always extremely lucid, and the reviewer has noticed few errors 
(apart from those already covered by a list of errata). But surely the exact test for 2 x 2 tables was 
independently discovered by Fisher and by Irwin (see Metron, 12, 1935, 1)? 

The book is too condensed for the unassisted student, but one working through the course under a 


teacher’s guidance will find it very useful. CEDRIC A. B. SMITH 





542 Reviews 


Statistics and Mathematics in Biology. Edited by O. Kemprnorne, T. A. BANcrort, 
J. W. Gowern and J. L. Lusn. Ames, lowa: Iowa State College Press. 1954. Pp. 
ix +632. $6.75. 


In case the title should mislead, it should be made clear at once that this is not a text-book on the use 
of statistical and mathematical methods in biology. It is a collection of forty-four essays by various 
authors (many of great distinction) on a wide range of topics in which biology and statistics meet. In 
many cases these deal with recent research by the author; some are highly theoretical, others concern 
practicai problems involving only simple mathematics. Almost all are lucidly expressed, and can be 
read with interest and profit. The subjects discussed include causation, regression, path coefficients and 
multivariate analysis, classification problems, experimental designs, competition between species, 
growth curves, sampling of populations, toxicity tests, bio-assay, taste testing, feeding experiments, 
animal behaviour, gene frequency estimation, the effect of radiation on cells and viruses, genetic and 
breeding problems and the estimation of nucleoproteins in cell division. 

Naturally this is not the sort of book to which one can turn for complete and detailed information 
on any one topic, although the very full bibliography may indicate the source of such information. 
But it is one which it would be useful to have about the laboratory (whether statistical or biological), 
since somewhere it may suggest the appropriate line of approach to a particular problem, or indicate a 
profitable new line of research. 

One minor point may be mentioned. K. R. Nair has pointed out (Bull. Calcutta Statist. Ass. 20, 
1954, 18) that Quenouille’s ‘almost balanced incomplete block designs’ are particular cases of the 
Bose-Nair-Rao ‘partially balanced incomplete blocks’ (Bose & Nair, Sankhya, 4, 1939, 337). 


CEDRIC A. B. SMITH 


Statistical Analysis in Chemistry and the Chemical Industry. By C. A. Bennerr 
and N. L. Franky. New York: John Wiley and Sons, Inc. 1954. Pp. xvi+ 724. 58s. 


This book contains a well-written account of statistical techniques, some of which are not readily 
available elsewhere. After an introductory chapter, general notions of frequency distributions, 
measures of location, spread and association are introduced. There follows a discussion of probability 
theory including some account of moment generating functions, cumulants and k statistics and the 
common sampling distributions derived from the normal law. A short account of the properties of 
gamma and beta functions is followed by a chapter on confidence limits and tests of significance, in- 
cluding some non-parametric methods and a somewhat brief discussion of sequential tests. In 
Chapter 6 there is an account of linear regression, curvilinear regression and discriminant functions 
and of methods for the solution of the linear equations which arise in the application of these techniques. 

Chapter 7, which contains over 150 pages, is concerned with the analysis of variance, and this is 
followed by a discussion of almost equal length on the design of experiments. The book ends with 
chapters on the analysis of counted data, quality control and tests for randomness. 

To those workers in the chemical industry and elsewhere who have been introduced to statistical 
methods via a ‘cookery book’ and who now wish to broaden their knowledge and to learn something 
more of the basis of these methods, this book is reeommended. It will be less useful to those who wish 
to know how statistics should be used in practice. Some of the examples fail to demonstrate the value 
of the method discussed, or the sort of eircumstances in which it might be used, and instances occur 
where the statistical analysis confuses ratner than clarifies the situation. This is particularly true of the 
example on p. 395, where the situation is much more readily appreciated from a set of four simple 
graphs than from the rather complicated analysis of variance which is given in this book. 

It is very important that the experimenter should be the master and not the slave of statistical 
methods. In particular, he must be very clear about what it is he really wants to know and should nct 
allow himself to be conditioned into using statistical ideas and concepts which are not really appropriate 
to his investigation. In particular, he should not allow himself to be overawed by apparently com- 
plicated mathematical machinery. When he feels his problem is not really answered by the application 
of standard techniques he should try to deal with it from first principles. His solution even if it is 
somewhat approximate and unorthodox will usually serve him better than the misapplication of a 
mathematically ‘exact’ technique. In helping to instil these first principles this text-book is valuable; 
the reader should, however, not take too seriously some of the applications which are described. 


G. E. P. BOX 














Reviews 543 


A Million Random Digits with 100,000 Normal Deviates. The Ranp Corporation. 
Illinois, U.S.A.: The Free Press. 1955. Pp. xxv+600. $10.00. 


Since L. H. C. Tippett’s set of 40,000 random numbers was first published in 1927 as Tracts for Com- 
puters, No. xv, there has been a steady demand for random numbers from a wide variety of users. This 
was recognized by the publication of M. G. Kendall’s and B. Babington Smith’s set of 100,000 digits in 
Tracts for Computers, No. xxiv, just prior to the war. Since then the demand has grown rather than 
diminished, and the recent emphasis on Monte Carlo methods has led to the need for an even larger 
supply. This need has now been filled by the publication of this volume of 1,000,000 random digits 
produced with the aid of an electronic roulette wheel by the Rand Corporation. Tests for randomness, 
such as the frequency test. the so-called poker test and the pairs test, have been applied to the numbers 
and give satisfactory results. 

Half the table has been used to obtain 100,000 random normal deviates by the conversion of five- 
digit random numbers with a table of the cumulative normal distribution function. The deviates 


obtained are tabulated to three decimal places. P. G. MOORE 


Tablitsi dlya vichisleniya nepolnoi [-funktsii veroyatnosti y? (Tables for the 
calculation of the incomplete ['-function and the y*-probability function). 
By E. E. Stursxi (Ed. A. N. Kotmocorov). Moscow and Leningrad: Izdatelstvo 
Akademii Nauk SSSR. 1950. Pp. 14+55 pp. tables. 


These tables are intended to facilitate the computation of 


l * 
P(x2, n) = ————. n—le—be" da 
8 m= arg | x ese 
and the related incomplete gamma-function. There is a table of P(x?,n) for 
x?: 0 (0-1) 3-2 and x?: 3-2 (0-2) 7-0 (0-5) 10-0 (1-0) 35-0, 
n: 0 (0-05) 0-2 (0-1) 6-0 mn: 0 (0-1) 0-4 (0-2) 6-0. 
There are also tables of three auxiliary functions designed for special purposes. To facilitate inter- 
polation for small values of x* the function 
T(x, n) = (4x*)-i"(1 — P(X?, n)) 

is tabled for 

x?: 0 (0-05) 0-2 (0-1) 1-0, 

n: 0 (0-05) 0-2 (0-1) 6-0. 
For larger values of n the function P(x?, n) is tabled, with argument 

t=1/(2x%) —y(2n) 

in place of y?, for 

t: —4-0 (0-1) 48, 

n: 6-0 (0-5) 11-0 (1-0) 32-0. 
A further table gives values of the function 7(t, x)= P(x?, n), where ¢ is as above and x=,/(2/n). This 
function is tabled for t= —4-5 (0-1) 4:8, 

x=0 (0-02) 0-22 (0-01) 0-25. 
It is claimed that this small (seven page) table effectively provides for the computation of the in- 
complete gamma-function in the region (n> 102) not covered by the existing tables of K. Pearson. 

A final table contains coefficients of Everett’s and Newton’s interpolation formulae. Second and 

fourth central differences with respect to x? (or ¢), and second central differences with respect to n (or x) 
are printed in the tables. 


The tables are clearly reproduced and there is an introduction setting out their genesis, calculation 


and use, with a number of worked examples. N. L. JOHNSON 





Reviews 


OTHER BOOKS RECEIVED 
. Rank Correlation Methods, 2nd edition. By M. G. Kenpatx. London: Charles 
Griffin and Co. 1955. Pp. 196. 36s. 


. Manpower Shortage and the Fall of the Roman Empire. By A. E. R. Boax. 
U.S.A.: University of Michigan Press. 1955. Pp. 169. $4.50. 


. The Foundations of Statistics. By L. J. Savacz. U.S.A.: John Wiley and Sons; 
London: Chapman and Hall. 1954. Pp. xv +294. 48s. 


. Statistics in Research. By B. Ostiz. U.S.A.: Iowa State College. 1954. Pp. 
xiv +487. $6.95. 


. Annual Epidemiological and Vital Statistics for 1952. Switzerland: World Health 
Organization, Geneva. 1955. Pp. x +533. 50s. 


Publications of the U.S. Department of Commerce, National Bureau of Standards 


. Tables of Sine and Cosine Integrals for arguments from 10 to 100. Applied 
Mathematics Series, 32. 1954. Pp. xv+187. $2.25. 


. Tables of the Gamma Function for Complex Argument. Applied Mathematics 
Series, 34. 1954. Pp. xvi+105. $2.00. 


. Tables of Functions and of Zeros of Functions. Applied Mathematics Series, 
37. 1954. Pp. ix+211. $2.25. 


. Contributions to the Solution of Systems of Linear Equations and the Deter- 
mination of Eigen-values. Edited by Orca Tausskry. Applied Mathematics 
Series, 39. 1954. Pp. 139. $2.00. 


10. Tables of the Error Function and its derivatives. Applied Mathematics Series, 
41. 1954. Pp. xi+302. $3.25. 











